Project Process :
This project used CRISP-DM methodology for whole process, except for the last process of deployment. The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as the base for a data science process.
It has six sequential phases:
Published in 1999 to standardize data mining processes across industries, it has since become the most common methodology for data mining, analytics, and data science projects.
Dataset Context Assumptions :
The dataset provided serves as a repository of past historical records, documenting the interactions and financial behaviors of loan applicants within the MoneyLion ecosystem. Here's a detailed elaboration on the assumptions surrounding the dataset's integrity and reliability:
Historical Context :
Assumption of Integrity :
Assumption of Reliability :
Implications for Analysis :
Continuous Monitoring and Validation :
At MoneyLion, our ongoing commitment revolves around enhancing the accuracy of assessing applicant risk. The primary objective of our endeavor is to reliably predict the risk and repayment quality of loans for individual applicants.
Purpose : This assessment central focus lies in predicting the risk associated with loan repayment for each applicant. By comprehensively evaluating the factors contributing to loan quality, it aims to make informed decisions regarding the applicants' eligibility for approval.
Business Context Process Assumptions :
Client Loan Application Submission :
Application Details Sent to Default Approval Team :
Prediction Result from Default Approval Team :
Loan Approval Decision by MoneyLion :
My ideal scenario envisions the integration of the predictive model between steps 2 and 3 in the loan approval process. During this pivotal stage, the model meticulously evaluates the risk associated with each loan application, serving as the final checkpoint before granting approval.
Significance of Model Placement : Positioning the predictive model between steps 2 and 3 holds strategic importance. In step 2, a significant portion of potentially fraudulent applications is efficiently filtered out, allowing us to concentrate our efforts on in-depth risk assessment. By integrating the model at this juncture, we enhance our ability to discern the nuances of genuine applications, identifying subtle patterns and indicators that impact repayment potential.
Focus on Risk Assessment :
The primary focus of my model lies in assessing the risk associated with individual applications. By meticulously analyzing various factors, including apr, payFrequency, and other relevant variables, the model provides a nuanced understanding of each applicant's risk profile. This granular evaluation enables me to make highly informed decisions, aligning the loan approval process with MoneyLion's specific risk tolerance thresholds.
Enabling Informed Business Decisions : My strategic approach aims to empower MoneyLion with a streamlined method of evaluating application risk tailored to own risk tolerance criteria. By leveraging the insights generated by our predictive model, MoneyLion gains a comprehensive view of each applicant's creditworthiness and repayment potential. This newfound clarity opens avenues for well-informed, data-driven business decisions, fostering a lending environment characterized by prudence and financial acumen.
In essence, integrating the predictive model in this manner not only fortifies defense against fraudulent applications but also equips with the tools to make astute lending choices. By aligning the risk assessment with unique risk tolerance parameters, it pave the way for a secure and informed lending landscape at MoneyLion.
# Time measurement
import time
st = time.time()
## Library
import pandas as pd
import numpy as np
import warnings
import seaborn as sns
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import plotly.offline as py
from scipy.stats import kurtosis, skew
import statistics as stat
from utils import *
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import make_scorer, accuracy_score, f1_score
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.neural_network import MLPClassifier
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
import shap
/Users/kimhoonglim/anaconda3/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
# Ignore warnings
with warnings.catch_warnings():
warnings.simplefilter("ignore")
fxn()
# Always good to set a seed for reproducibility
SEED = 7
np.random.seed(SEED)
There are 3 dataset given, namely loan.csv, payment.csv, clarity_underwriting_variables.csv
# For simplicity, we use name (df1,df2,df3) to represent loan, payment
# and clarity_underwriting_variables dataset respectively.
df1 = pd.read_csv('loan.csv',parse_dates=['applicationDate', 'originatedDate'])
df2 = pd.read_csv('payment.csv')
df3 = pd.read_csv('clarity_underwriting_variables.csv',low_memory=False)
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 577682 entries, 0 to 577681 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loanId 577426 non-null object 1 anon_ssn 577682 non-null object 2 payFrequency 576409 non-null object 3 apr 573760 non-null float64 4 applicationDate 577682 non-null datetime64[ns] 5 originated 577682 non-null bool 6 originatedDate 46044 non-null datetime64[ns] 7 nPaidOff 577658 non-null float64 8 approved 577682 non-null bool 9 isFunded 577682 non-null int64 10 loanStatus 577291 non-null object 11 loanAmount 575432 non-null float64 12 originallyScheduledPaymentAmount 577682 non-null float64 13 state 577550 non-null object 14 leadType 577682 non-null object 15 leadCost 577682 non-null int64 16 fpStatus 51723 non-null object 17 clarityFraudId 357693 non-null object 18 hasCF 577682 non-null int64 dtypes: bool(2), datetime64[ns](2), float64(4), int64(3), object(8) memory usage: 76.0+ MB
df1.head(15)
| loanId | anon_ssn | payFrequency | apr | applicationDate | originated | originatedDate | nPaidOff | approved | isFunded | loanStatus | loanAmount | originallyScheduledPaymentAmount | state | leadType | leadCost | fpStatus | clarityFraudId | hasCF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | LL-I-07399092 | beff4989be82aab4a5b47679216942fd | B | 360.00 | 2016-02-23 17:29:01.940 | False | NaT | 0.0 | False | 0 | Withdrawn Application | 500.0 | 978.27 | IL | bvMandatory | 6 | NaN | 5669ef78e4b0c9d3936440e6 | 1 |
| 1 | LL-I-06644937 | 464f5d9ae4fa09ece4048d949191865c | B | 199.00 | 2016-01-19 22:07:36.778 | True | 2016-01-20 15:49:18.846 | 0.0 | True | 1 | Paid Off Loan | 3000.0 | 6395.19 | CA | prescreen | 0 | Checked | 569eb3a3e4b096699f685d64 | 1 |
| 2 | LL-I-10707532 | 3c174ae9e2505a5f9ddbff9843281845 | B | 590.00 | 2016-08-01 13:51:14.709 | False | NaT | 0.0 | False | 0 | Withdrawn Application | 400.0 | 1199.45 | MO | bvMandatory | 3 | NaN | 579eab11e4b0d0502870ef2f | 1 |
| 3 | LL-I-02272596 | 9be6f443bb97db7e95fa0c281d34da91 | B | 360.00 | 2015-08-06 23:58:08.880 | False | NaT | 0.0 | False | 0 | Withdrawn Application | 500.0 | 1074.05 | IL | bvMandatory | 3 | NaN | 555b1e95e4b0f6f11b267c18 | 1 |
| 4 | LL-I-09542882 | 63b5494f60b5c19c827c7b068443752c | B | 590.00 | 2016-06-05 22:31:34.304 | False | NaT | 0.0 | False | 0 | Rejected | 350.0 | 814.37 | NV | bvMandatory | 3 | NaN | 5754a91be4b0c6a2bf424772 | 1 |
| 5 | LL-I-09734486 | b5541f49472fa0fce8e473306768f7fb | M | 650.00 | 2016-06-12 19:27:47.951 | False | NaT | 0.0 | False | 0 | Withdrawn Application | 300.0 | 738.18 | IN | organic | 0 | NaN | 574e2029e4b061d2c3a16a4c | 1 |
| 6 | LL-I-15006968 | 1828c64bb2dffeae88b27174a9f79ecc | B | 680.00 | 2017-01-12 18:04:37.921 | False | NaT | 0.0 | False | 0 | Withdrawn Application | 400.0 | 1362.92 | TX | lead | 25 | NaN | 5877c533e4b08f2480ab7e5a | 1 |
| 7 | LL-I-08327946 | 02596517e7633c7e87e6b333a0fb1bbe | M | 449.99 | 2016-04-01 22:13:02.131 | False | NaT | 0.0 | False | 0 | Withdrawn Application | 350.0 | 759.84 | UT | bvMandatory | 3 | NaN | NaN | 0 |
| 8 | LL-I-10568307 | 523ed92e1145eb688bb631da24695197 | S | 251.00 | 2016-07-27 00:19:52.808 | False | NaT | 0.0 | False | 0 | Withdrawn Application | 2600.0 | 8230.01 | CA | california | 40 | NaN | 5797fe22e4b0d05020f298ce | 1 |
| 9 | LL-I-05518977 | 47bf79119075e41ef65510f2900c8e4a | B | 360.00 | 2015-12-09 18:17:33.622 | False | NaT | 0.0 | False | 0 | Withdrawn Application | 500.0 | 995.22 | IL | bvMandatory | 6 | NaN | 563281d2e4b07887adc5bc0a | 1 |
| 10 | LL-I-12004066 | 15340aa86b89972ac3c1572e2053e107 | B | 590.00 | 2016-09-14 17:49:10.712 | False | NaT | 0.0 | False | 0 | Withdrawn Application | 500.0 | 1457.26 | CA | bvMandatory | 10 | NaN | NaN | 0 |
| 11 | LL-I-16484048 | d7fa2d3cdff861249f57644e1050334a | B | 645.00 | 2017-01-30 22:40:47.965 | False | NaT | 0.0 | False | 0 | Withdrawn Application | 700.0 | 2115.90 | FL | organic | 0 | NaN | 588fa80ae4b08f24a3dd6348 | 1 |
| 12 | LL-I-06936557 | 2f90cd4d32d373e4c044a67fc4b9cb88 | I | 590.00 | 2016-02-01 01:56:09.600 | False | NaT | 0.0 | False | 0 | Rejected | 300.0 | 856.50 | TN | bvMandatory | 3 | NaN | 56aebd69e4b0138a4b2d8086 | 1 |
| 13 | LL-I-08740673 | 50e168ad9d93f7ee73e5f041822bd8dd | B | 590.00 | 2016-04-15 16:35:01.146 | False | NaT | 0.0 | False | 0 | Withdrawn Application | 800.0 | 2594.33 | MI | bvMandatory | 10 | NaN | 57112235e4b0cd575d778568 | 1 |
| 14 | LL-I-13956713 | 8f3bd2f82d48d6e3b221247be473d2e9 | W | 645.00 | 2016-12-27 12:09:22.807 | False | NaT | 0.0 | False | 0 | Rejected | 600.0 | 2056.98 | RI | bvMandatory | 3 | NaN | 58625b73e4b08f246a81bad8 | 1 |
df1.describe()
| apr | nPaidOff | isFunded | loanAmount | originallyScheduledPaymentAmount | leadCost | hasCF | |
|---|---|---|---|---|---|---|---|
| count | 573760.000000 | 577658.000000 | 577682.000000 | 575432.000000 | 577682.000000 | 577682.000000 | 577682.000000 |
| mean | 553.080972 | 0.037887 | 0.067480 | 514.245084 | 1428.897209 | 7.854389 | 0.619187 |
| std | 110.046159 | 0.333366 | 0.250852 | 320.939929 | 925.009141 | 12.853451 | 0.485587 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -816.710000 | 0.000000 | 0.000000 |
| 25% | 490.000000 | 0.000000 | 0.000000 | 350.000000 | 1023.640000 | 3.000000 | 0.000000 |
| 50% | 590.000000 | 0.000000 | 0.000000 | 500.000000 | 1245.250000 | 3.000000 | 1.000000 |
| 75% | 601.000000 | 0.000000 | 0.000000 | 500.000000 | 1615.660000 | 6.000000 | 1.000000 |
| max | 705.590000 | 21.000000 | 1.000000 | 5000.000000 | 19963.630000 | 200.000000 | 1.000000 |
In this file there are 18 columns:
Every row represents an accepted loan application/ successfully funded loan. Missing values can exist. Some fields are only implemented after the loan application was made.
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 689364 entries, 0 to 689363 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loanId 689364 non-null object 1 installmentIndex 689364 non-null int64 2 isCollection 689364 non-null bool 3 paymentDate 689364 non-null object 4 principal 689364 non-null float64 5 fees 689364 non-null float64 6 paymentAmount 689364 non-null float64 7 paymentStatus 689364 non-null object 8 paymentReturnCode 31533 non-null object dtypes: bool(1), float64(3), int64(1), object(4) memory usage: 42.7+ MB
df2.head(15)
| loanId | installmentIndex | isCollection | paymentDate | principal | fees | paymentAmount | paymentStatus | paymentReturnCode | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | LL-I-00000021 | 1 | False | 2014-12-19T05:00:00 | 22.33 | 147.28 | 169.61 | Checked | NaN |
| 1 | LL-I-00000021 | 2 | False | 2015-01-02T05:00:00 | 26.44 | 143.17 | 169.61 | Checked | NaN |
| 2 | LL-I-00000021 | 3 | False | 2015-01-16T05:00:00 | 31.30 | 138.31 | 169.61 | Checked | NaN |
| 3 | LL-I-00000021 | 4 | False | 2015-01-30T05:00:00 | 37.07 | 132.54 | 169.61 | Checked | NaN |
| 4 | LL-I-00000021 | 5 | False | 2015-02-13T05:00:00 | 43.89 | 125.72 | 169.61 | Checked | NaN |
| 5 | LL-I-00000021 | 6 | False | 2015-02-27T05:00:00 | 51.97 | 117.64 | 169.61 | Checked | NaN |
| 6 | LL-I-00000021 | 7 | False | 2015-03-13T04:00:00 | 61.54 | 108.07 | 169.61 | Checked | NaN |
| 7 | LL-I-00000021 | 8 | False | 2015-03-27T04:00:00 | 72.87 | 96.74 | 169.61 | Checked | NaN |
| 8 | LL-I-00000021 | 9 | False | 2015-04-10T04:00:00 | 86.29 | 83.32 | 169.61 | Checked | NaN |
| 9 | LL-I-00000021 | 10 | False | 2015-04-24T04:00:00 | 102.17 | 67.44 | 169.61 | Rejected | R01 |
| 10 | LL-I-00000021 | 11 | False | 2015-05-08T04:00:00 | 120.98 | 48.63 | 169.61 | Checked | NaN |
| 11 | LL-I-00000021 | 12 | False | 2015-05-22T04:00:00 | 143.15 | 26.35 | 169.50 | Checked | NaN |
| 12 | LL-I-00000021 | 1 | True | 2016-04-28T22:44:43.736000 | 0.01 | 0.00 | 0.01 | Checked | NaN |
| 13 | LL-I-00000231 | 1 | False | 2014-12-11T05:00:00 | 0.00 | 84.16 | 84.16 | Checked | NaN |
| 14 | LL-I-00000231 | 2 | False | 2014-12-26T05:00:00 | 27.20 | 147.28 | 174.48 | Checked | NaN |
df2.describe()
| installmentIndex | principal | fees | paymentAmount | |
|---|---|---|---|---|
| count | 689364.000000 | 689364.000000 | 689364.000000 | 689364.000000 |
| mean | 10.553222 | 45.557543 | 67.003994 | 112.680232 |
| std | 8.049530 | 81.724683 | 59.789510 | 105.783710 |
| min | 1.000000 | -303.370000 | -42.560000 | -337.700000 |
| 25% | 5.000000 | 13.180000 | 28.820000 | 56.810000 |
| 50% | 9.000000 | 27.610000 | 51.300000 | 86.340000 |
| 75% | 14.000000 | 53.380000 | 86.440000 | 135.090000 |
| max | 105.000000 | 4000.000000 | 1257.710000 | 4063.600000 |
9 columns in this file:
Each row in this file represents an ACH attempt (either scheduled for the future or has elapsed in the past) associated to the loan.
df3.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 49752 entries, 0 to 49751 Data columns (total 54 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 .underwritingdataclarity.clearfraud.clearfraudinquiry.thirtydaysago 49750 non-null float64 1 .underwritingdataclarity.clearfraud.clearfraudinquiry.twentyfourhoursago 49750 non-null float64 2 .underwritingdataclarity.clearfraud.clearfraudinquiry.oneminuteago 49750 non-null float64 3 .underwritingdataclarity.clearfraud.clearfraudinquiry.onehourago 49750 non-null float64 4 .underwritingdataclarity.clearfraud.clearfraudinquiry.ninetydaysago 49750 non-null float64 5 .underwritingdataclarity.clearfraud.clearfraudinquiry.sevendaysago 49750 non-null float64 6 .underwritingdataclarity.clearfraud.clearfraudinquiry.tenminutesago 49750 non-null float64 7 .underwritingdataclarity.clearfraud.clearfraudinquiry.fifteendaysago 49750 non-null float64 8 .underwritingdataclarity.clearfraud.clearfraudinquiry.threesixtyfivedaysago 49750 non-null float64 9 .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryonfilecurrentaddressconflict 49712 non-null object 10 .underwritingdataclarity.clearfraud.clearfraudindicator.totalnumberoffraudindicators 49735 non-null float64 11 .underwritingdataclarity.clearfraud.clearfraudindicator.telephonenumberinconsistentwithaddress 49712 non-null object 12 .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryageyoungerthanssnissuedate 49712 non-null object 13 .underwritingdataclarity.clearfraud.clearfraudindicator.onfileaddresscautious 49712 non-null object 14 .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryaddressnonresidential 49712 non-null object 15 .underwritingdataclarity.clearfraud.clearfraudindicator.onfileaddresshighrisk 49712 non-null object 16 .underwritingdataclarity.clearfraud.clearfraudindicator.ssnreportedmorefrequentlyforanother 49712 non-null object 17 .underwritingdataclarity.clearfraud.clearfraudindicator.currentaddressreportedbytradeopenlt90days 49712 non-null object 18 .underwritingdataclarity.clearfraud.clearfraudindicator.inputssninvalid 49712 non-null object 19 .underwritingdataclarity.clearfraud.clearfraudindicator.inputssnissuedatecannotbeverified 49712 non-null object 20 .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryaddresscautious 49712 non-null object 21 .underwritingdataclarity.clearfraud.clearfraudindicator.morethan3inquiriesinthelast30days 49712 non-null object 22 .underwritingdataclarity.clearfraud.clearfraudindicator.onfileaddressnonresidential 49712 non-null object 23 .underwritingdataclarity.clearfraud.clearfraudindicator.creditestablishedpriortossnissuedate 49712 non-null object 24 .underwritingdataclarity.clearfraud.clearfraudindicator.driverlicenseformatinvalid 44703 non-null object 25 .underwritingdataclarity.clearfraud.clearfraudindicator.inputssnrecordedasdeceased 49712 non-null object 26 .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryaddresshighrisk 49712 non-null object 27 .underwritingdataclarity.clearfraud.clearfraudindicator.inquirycurrentaddressnotonfile 49712 non-null object 28 .underwritingdataclarity.clearfraud.clearfraudindicator.bestonfilessnissuedatecannotbeverified 49712 non-null object 29 .underwritingdataclarity.clearfraud.clearfraudindicator.highprobabilityssnbelongstoanother 49712 non-null object 30 .underwritingdataclarity.clearfraud.clearfraudindicator.maxnumberofssnswithanybankaccount 49735 non-null float64 31 .underwritingdataclarity.clearfraud.clearfraudindicator.bestonfilessnrecordedasdeceased 49712 non-null object 32 .underwritingdataclarity.clearfraud.clearfraudindicator.currentaddressreportedbynewtradeonly 49712 non-null object 33 .underwritingdataclarity.clearfraud.clearfraudindicator.creditestablishedbeforeage18 49712 non-null object 34 .underwritingdataclarity.clearfraud.clearfraudindicator.telephonenumberinconsistentwithstate 49071 non-null object 35 .underwritingdataclarity.clearfraud.clearfraudindicator.driverlicenseinconsistentwithonfile 10055 non-null object 36 .underwritingdataclarity.clearfraud.clearfraudindicator.workphonepreviouslylistedascellphone 21416 non-null object 37 .underwritingdataclarity.clearfraud.clearfraudindicator.workphonepreviouslylistedashomephone 21416 non-null object 38 .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssnnamematch 49720 non-null object 39 .underwritingdataclarity.clearfraud.clearfraudidentityverification.nameaddressmatch 49720 non-null object 40 .underwritingdataclarity.clearfraud.clearfraudidentityverification.phonematchtype 48799 non-null object 41 .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssnnamereasoncodedescription 2669 non-null object 42 .underwritingdataclarity.clearfraud.clearfraudidentityverification.phonematchresult 49712 non-null object 43 .underwritingdataclarity.clearfraud.clearfraudidentityverification.nameaddressreasoncodedescription 5627 non-null object 44 .underwritingdataclarity.clearfraud.clearfraudidentityverification.phonematchtypedescription 48799 non-null object 45 .underwritingdataclarity.clearfraud.clearfraudidentityverification.overallmatchresult 49720 non-null object 46 .underwritingdataclarity.clearfraud.clearfraudidentityverification.phonetype 1515 non-null object 47 .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssndobreasoncode 9029 non-null object 48 .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssnnamereasoncode 2669 non-null object 49 .underwritingdataclarity.clearfraud.clearfraudidentityverification.nameaddressreasoncode 5627 non-null object 50 .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssndobmatch 49720 non-null object 51 .underwritingdataclarity.clearfraud.clearfraudidentityverification.overallmatchreasoncode 49720 non-null float64 52 clearfraudscore 49615 non-null float64 53 underwritingid 49752 non-null object dtypes: float64(13), object(41) memory usage: 20.5+ MB
A1. Payment Dataset Decision
Recognize that the payment dataset lacks relevance for our prediction because payments are made only after the loan approval. Since the loan approval is pending, the payment data is not utilized in our modeling stage. Hence, we opt not to include it in our analysis.
A2. Selecting only the ClearFraudScore Column
In the process of analyzing the clarity_underwriting_variables dataset, a strategic decision has been made to focus specifically on the ClearFraudScore column as we could assume all of the important points and variables are included and mentioned in this variable.
A3. Filtering by isFunded Column
In the process of data transformation, one of the crucial steps involves filtering the dataset based on the isFunded column, focusing specifically on the instances where the value is '1'. Here’s a detailed explanation of why this filtering is essential and how it impacts the dataset:
Purpose of Filtering:
Importance of Focusing on Funded Loans:
Refined Risk Evaluation:
Strategic Decision-Making:
A4. Dropping the 'clarityFraudId' and 'underwritingid' Column
loan.csv and clarity_underwriting_variables.csv, we only required to maintain a primary key (which is loanId variable) for the merged dataset.# Only maintain 'underwritingid' and 'clearfraudscore' from df3
df3 = df3[['underwritingid', 'clearfraudscore']]
# Merged merged_df and df3 as merged_df1
merged_df1 = pd.merge(df1, df3, left_on = 'clarityFraudId', right_on = 'underwritingid')
temp = merged_df1[(merged_df1['isFunded'] == True)]
filter_data1 = temp.drop(['clarityFraudId','underwritingid'],axis = 1)
filter_data1.head()
| loanId | anon_ssn | payFrequency | apr | applicationDate | originated | originatedDate | nPaidOff | approved | isFunded | loanStatus | loanAmount | originallyScheduledPaymentAmount | state | leadType | leadCost | fpStatus | hasCF | clearfraudscore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | LL-I-06644937 | 464f5d9ae4fa09ece4048d949191865c | B | 199.0 | 2016-01-19 22:07:36.778 | True | 2016-01-20 15:49:18.846 | 0.0 | True | 1 | Paid Off Loan | 3000.0 | 6395.19 | CA | prescreen | 0 | Checked | 1 | 879.0 |
| 1 | LL-I-17826953 | 7529d29c9d943cc55bc71a85d9c6b7a3 | W | 645.0 | 2017-03-01 15:55:41.823 | True | 2017-03-01 16:15:13.016 | 0.0 | True | 1 | New Loan | 400.0 | 1364.87 | NJ | lead | 40 | Checked | 1 | 658.0 |
| 2 | LL-I-16519347 | bc9fa73468acf18cb968a3a6a3c4697c | S | 490.0 | 2017-01-31 17:48:17.381 | True | 2017-01-31 18:00:14.520 | 2.0 | True | 1 | Paid Off Loan | 1250.0 | 3313.78 | MO | organic | 0 | Checked | 1 | 835.0 |
| 3 | LL-I-17491287 | bc9fa73468acf18cb968a3a6a3c4697c | S | 490.0 | 2017-02-21 17:20:23.671 | True | 2017-02-21 17:45:11.088 | 3.0 | True | 1 | Paid Off Loan | 1250.0 | 3112.16 | MO | organic | 0 | Checked | 1 | 835.0 |
| 4 | LL-I-15679301 | bc9fa73468acf18cb968a3a6a3c4697c | S | 490.0 | 2017-01-18 01:16:24.941 | True | 2017-01-18 01:30:16.712 | 1.0 | True | 1 | Paid Off Loan | 500.0 | 1292.09 | MO | organic | 0 | Checked | 1 | 835.0 |
filter_data1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 32180 entries, 0 to 36333 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loanId 32180 non-null object 1 anon_ssn 32180 non-null object 2 payFrequency 32180 non-null object 3 apr 32180 non-null float64 4 applicationDate 32180 non-null datetime64[ns] 5 originated 32180 non-null bool 6 originatedDate 32180 non-null datetime64[ns] 7 nPaidOff 32178 non-null float64 8 approved 32180 non-null bool 9 isFunded 32180 non-null int64 10 loanStatus 32180 non-null object 11 loanAmount 32180 non-null float64 12 originallyScheduledPaymentAmount 32180 non-null float64 13 state 32180 non-null object 14 leadType 32180 non-null object 15 leadCost 32180 non-null int64 16 fpStatus 32175 non-null object 17 hasCF 32180 non-null int64 18 clearfraudscore 32088 non-null float64 dtypes: bool(2), datetime64[ns](2), float64(5), int64(3), object(7) memory usage: 4.5+ MB
The observations from the summary of the merged dataset indicate the presence of missing values in several key variables, notably in fpStatus, paymentReturnCode, clearfraudscore.
Addressing missing values systematically is crucial to maintain data quality, ensuring that subsequent analyses and interpretations are reliable and accurate. By understanding the implications of missing data and employing appropriate strategies, researchers can mitigate the impact of missing values on their analyses and conclusions.
print("Number of rows :",len(filter_data1))
print("Number of columns :",len(filter_data1.columns))
Number of rows : 32180 Number of columns : 19
When analyzing a dataset, each potential variable (excluding primary keys, as they are unique identifiers and not typically analyzed for insights) should be individually examined. This process involves creating visualizations such as bar charts and box plots to gain insights into the data. Let's break down the steps for this analysis:
Variable Identification: Start by identifying the variables in the merged dataset that we want to analyze. These can include numerical variables (representing quantitative data) and categorical variables (representing qualitative data).
Understanding Categorical Variables:
For categorical variables (such as originated, or approved), using bar charts is a common practice. A bar chart displays the count or frequency of each category in the variable. By creating a bar chart, we can quickly identify which categories are more common or rare in the dataset. This can reveal patterns and imbalances within the data. For instance, if we're analyzing a dataset containing information about the loan approval status, a bar chart can show the distribution of different current of a loan application.
Analyzing Numerical Variables:
For numerical variables (such as apr, or loanAmount), box plots are often used. A box plot provides a summary of the variable's distribution, showing the minimum, first quartile (Q1), median (second quartile or Q2), third quartile (Q3), and maximum values. Additionally, box plots help identify potential outliers in the data. Outliers are data points that significantly deviate from the rest of the data and can provide valuable insights into unusual or exceptional cases.
Identifying Patterns and Outliers:
paymentAmount variable, a box plot can indicate if there are exceptionally high earners or unusually low-income individuals that capable of the repayment amount.plot_discrete_variable(filter_data1, 'payFrequency',0)
Observations : The bar plot provides valuable insights into the preferred loan repayment frequencies among applicants. It is evident from the data that the majority of applicants favor repaying their loans on a biweekly and weekly basis.
Analysis : Understanding the repayment preferences of applicants is essential for MoneyLion when designing loan products and repayment plans. The popularity of biweekly and weekly repayment schedules suggests that these frequencies are widely accepted and convenient for most applicants. MoneyLion can capitalize on this information by offering loan products with repayment schedules aligned with these preferred frequencies. Doing so enhances customer satisfaction and ensures a more seamless loan repayment process.
Additionally, acknowledging these preferences allows MoneyLion to tailor their communication and outreach strategies. By highlighting the availability of biweekly and weekly repayment options, we can attract more applicants who find these schedules compatible with their financial planning. Furthermore, emphasizing these popular repayment frequencies can also be a key marketing point, attracting potential applicants who prioritize flexibility and convenience in loan repayment. Overall, aligning loan products with applicant preferences enhances customer engagement and fosters a positive relationship between applicants and MoneyLion.
plot_box_plot(filter_data1, 'apr')
-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*- Mean : 532.511965382225 Var : 16543.310124476895 Skew : -1.01635240091674 Kurt : -0.006638147194099719 -*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*-
Observations : The box plot reveals valuable insights about the Apr variable in the dataset. One notable observation is the presence of a very low number of outliers, indicating that there are only a few applicants with exceptionally low Apr values. Additionally, the variable exhibits a high variance, indicating a significant range of Apr values in the dataset, emphasizing the diverse nature of the APR rates provided.
Furthermore, the negative skewness value of -1.02 suggests that the distribution of Apr values is skewed to the left. This implies that a majority of applicants have higher Apr rates, with a tail stretching towards the lower Apr values. The presence of negative skewness often signifies an asymmetrical distribution with an extended tail on the left side.
Additionally, the negative kurtosis value of -0.007 indicates that the distribution of Apr values has a heavier tail and a sharper peak compared to a normal distribution. While a kurtosis value close to 0 suggests a nearly normal distribution, a negative kurtosis, as observed here, signifies that the distribution has more extreme values, leading to a more peaked and heavy-tailed shape.
Analysis : Understanding the statistical properties of the Apr variable is crucial for financial analysis and decision-making. The presence of outliers suggests the need for further investigation into the applicants with exceptionally high or low APR rates, as these cases might require special attention or consideration in financial planning.
Moreover, the high variance, negative skewness, and positive kurtosis indicate the non-uniform distribution of APR rates among applicants. MoneyLion should be aware of this variability while setting APR rates for new applicants. Tailoring financial products and services to accommodate this diverse range of APR rates is essential to meet the needs of a wide array of customers effectively.
Additionally, this knowledge enables financial analysts to make informed predictions and assess the risk associated with different APR rates, ultimately guiding strategic decision-making processes within the company.
plot_time_frame(filter_data1, 'applicationDate')
plot_time_series(filter_data1,'applicationDate')
Observations : The detailed analysis of the applicationDate variable, considering various time-related components, provides significant insights into the loan application patterns. Here's a breakdown of the observations made from the variable plotting:
Yearly Patterns
applicationDate suggests an increase in the number of applications over time. This pattern could be influenced by various factors, and understanding the trend is crucial for making informed decisions. It's essential to dive deeper into the data and possibly conduct further analysis to identify the root causes of the upward trend. Monthly Peaks
Daily Fluctuations
Day of Week Preferences
Quarterly Trends
Analysis : Understanding these temporal patterns allows MoneyLion to optimize its marketing strategies, enhance customer engagement during peak periods, and tailor its lending services to meet the specific needs of applicants, especially during festival seasons and strategic quarters of the year. By aligning their services with these observed trends, MoneyLion can effectively cater to the fluctuating demands of applicants throughout the calendar year.
plot_time_frame(filter_data1, 'originatedDate')
plot_time_series(filter_data1,'originatedDate')
Observations : The analysis of the originatedDate variable, considering various time-related components, offers valuable insights that align with the patterns observed in the applicationDate variable. Here's an elaboration on the observations made from the variable analysis:
Yearly Patterns :
applicationDate: The yearly plot of originatedDate mirrors the patterns observed in the applicationDate variable. This alignment reaffirms the logical sequence of loan origination following the submission of loan applications. It supports the understanding that loan origination dates naturally follow the dates of application submissions, highlighting the inherent chronological flow in the lending process.Weekly Patterns :
applicationDate variable, the originatedDate variable shows a higher number of loans being released on Sundays. This divergence from the typical workweek days (Monday to Friday) suggests a unique operational process or business strategy. It could indicate that MoneyLion works on Sunday to release the loan money as to avoid any potential reduction from customer satisfaction.Implications for Decision Making :
Alignment with Business Logic :
originatedDate follows the same sequential pattern as applicationDate validates the logical flow of the lending process. It affirms that the origination of loans logically follows the submission and approval of loan applications, reinforcing the credibility and integrity of the dataset's temporal information.plot_discrete_variable(filter_data1, 'nPaidOff',0)
Observations : This variable provides insights into the loan repayment history of MoneyLion clients. The data indicates that a significant majority of applicants, totaling 28,295 observations, have not paid off any loans in the past.
Analysis : This observation raises important questions about the reasons behind this trend. Understanding why such a large number of applicants haven't paid off previous loans is crucial for making informed decisions moving forward. Possible factors contributing to this scenario could include financial challenges, economic conditions, or issues related to the loan terms and repayment structures. Investigating these factors in detail is essential before making any further decisions, as it can inform strategies to support applicants in managing their financial obligations and improve the overall loan repayment rates. Additionally, addressing the underlying causes can lead to the development of tailored financial solutions and support systems, enhancing the chances of successful loan repayment for future applicants.
plot_discrete_variable(filter_data1,'loanStatus',90)
Observations : This dataset provides valuable information regarding the loan status of all applicants. It is evident that a significant portion of applicants fall into categories such as external collection, paid off loan, new loan, and internal collection, collectively constituting more than half of the applicants' statuses. Additionally, there is a noteworthy presence of applicants categorized as charged off, settlement pending paid off, and customer voided new loan.
Analysis : Understanding the diverse loan statuses among applicants is essential for assessing the overall health of the lending portfolio. Applicants falling into categories like external collection and charged off might require special attention in terms of follow-up and risk management strategies. Conversely, applicants who have successfully paid off their loans or are seeking new loans represent positive outcomes, reflecting the effectiveness of the lending process. By analyzing these different statuses, MoneyLion can tailor their approaches, focusing on improving collections for overdue accounts and fostering positive customer experiences for those successfully completing their loan agreements or seeking new opportunities. This nuanced understanding enables lenders to make informed decisions, optimizing their loan management processes and enhancing customer satisfaction.
plot_box_plot(filter_data1,'loanAmount')
-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*- Mean : 656.2684431323804 Var : 246386.50836862344 Skew : 3.1998160947156817 Kurt : 12.158780653123596 -*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*-
Observations : This variable reveals the information about the loan amount that applicants held. Noted from the boxplot above that there exist a number of outliers at the end of the boxplot. We can clearly observed that this variable has a big number of variance, illustrating that the loan amount are widely spread out from the mean, suggesting a large range of loan amounts. The skewness of 3.2 indicates a highly positive skew, demonstrating the distribution of loan amount is skewed to the right, with a long tail on the higher loan amounts side. This suggests that there are few loans with very high amounts that are pulling the distribution into a positive direction. Meanwhile, the kurtosis value of 12.16 is considered high. It shows that the distribution has a heavy tail and a very sharp prak compared to normal distribution. In practical terms, a high kurtosis such as loanAmount variable suggests that there are outliers in the dataset. In this case, our initial observation from the boxplot is evidenced, indicating that very large loan amounts that deviate significantly from the rest of the data.
plot_box_plot(filter_data1,'originallyScheduledPaymentAmount')
-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*- Mean : 1804.740435849208 Var : 1726903.0102886457 Skew : 3.0576227329240933 Kurt : 12.52788999693735 -*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*-
Observations : Having the same characteristics as loan amount variable early, this variable reveals the information of originally scheduled repayment amount.
plot_discrete_variable(filter_data1,'state',90)
Observations : This variable provides insights into the geographic distribution of applicants based on their states. It is noteworthy that the majority of applicants originate from Ohio (OH), Illinois (IL), and Texas (TX), making these states the top three sources of applicants. Conversely, states like Rhode Island (RI), Alaska (AK), and North Dakota (ND) have the fewest applicants.
Analysis : Understanding this regional variation is crucial for designing an effective marketing strategy. By targeting regions with a high number of applicants, such as OH, IL, and TX, marketing efforts can be tailored to cater to the preferences and needs of the applicants in these states. Conversely, special attention can be given to regions with lower applicant numbers, like RI, AK, and ND, to explore ways to increase outreach and engagement, potentially tapping into untapped markets. This data-driven approach enables MoneyLion to optimize their marketing efforts and enhance their outreach to potential applicants across different states.
plot_discrete_variable(filter_data1,'leadType',90)
Observations : This variable provides important details about lead types, which play a pivotal role in establishing underwriting rules for leads in the lending process. The selection of lead types significantly influences the assessment and approval criteria for each lead.
Analysis : Examining the data reveals that the most common loan lead types are dominated by bvMandatory, lead, and organic. These prevalent lead types likely have specific underwriting guidelines and criteria tailored to them. On the other hand, there are only a few applicants categorized under the repeat and lionpay lead types, indicating their comparatively limited occurrence.
Understanding the distribution of lead types is crucial for MoneyLion when devising underwriting strategies. By recognizing the prevalence of certain lead types, MoneyLion can tailor their underwriting rules, risk assessment, and approval processes to align with these popular categories. Meanwhile, the presence of less common lead types highlights the need for specialized underwriting approaches for these cases.
Additionally, identifying the infrequent lead types, such as repeat and express, provides an opportunity for us to explore potential reasons behind their limited occurrence. This analysis can lead to targeted efforts in marketing, outreach, or product customization to increase engagement among applicants falling under these specific lead types. By adapting underwriting rules and strategies based on the prevalence and nature of lead types, MoneyLion can optimize their lending processes, enhance customer experiences, and ultimately improve the overall efficiency and effectiveness of our services.
plot_box_plot(filter_data1,'leadCost')
-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*- Mean : 13.785954008701056 Var : 459.3500202365246 Skew : 3.572512795165249 Kurt : 19.69015916801639 -*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*-
Observations : The presented variable represents the cost associated with leads. On average, the cost per lead is approximately 13.79. Considering the relatively high variance of 459.35, it suggests that while there is some fluctuation in lead costs, the observations tend to cluster around the mean, indicating a moderate variation.
Analysis : The average cost per lead provides a baseline for evaluating the expenses incurred in acquiring leads. The variance, although significant, suggests that the costs do not deviate drastically from the mean, indicating a certain level of consistency or predictability in lead acquisition expenditures.
However, the high positive skewness of 3.57 implies that the distribution of lead costs is skewed to the right. This indicates that there are a few instances where lead costs are substantially higher than the mean, pulling the distribution in the positive direction. Such skewness highlights the presence of outliers or exceptionally expensive leads that contribute to the overall rightward skew of the data.
Moreover, the high positive kurtosis of 19.69 indicates that the distribution of lead costs has very heavy tails and an extremely sharp peak compared to a normal distribution. This signifies that the dataset contains a substantial number of outliers with exceptionally high costs, making the distribution more peaked and the tails much heavier than a typical distribution.
Understanding these statistical properties is crucial for budgeting and financial planning. While the average cost provides a general overview, the skewness and kurtosis highlight the presence of outliers, emphasizing the importance of identifying and managing these high-cost leads. By recognizing and addressing these outliers, MoneyLion can refine their lead acquisition strategies, optimize their budget allocation, and improve the overall efficiency of their marketing efforts.
plot_discrete_variable(filter_data1,'fpStatus',0)
Observations : The provided variable represents the outcomes of the first payment made toward the loan. The data indicates that the majority of loans are checked or rejected during this initial payment assessment. However, there are specific instances classified as special cases, including skipped payments, cancelled transactions, pending statuses, and returned payments.
Analysis : Understanding the outcomes of the first loan payment is crucial for evaluating the efficiency and accuracy of the payment processing system. The prevalence of checked and rejected payments underscores the importance of rigorous validation procedures to maintain the integrity of the lending process. Identifying and rectifying the reasons for rejection can enhance the overall payment approval rates, ensuring a smoother experience for applicants.
Moreover, the presence of special cases such as skipped, cancelled, pending, and returned payments highlights the need for further investigation. These unique scenarios might arise due to specific circumstances, technical glitches, or human errors. Investigating these cases in detail allows MoneyLion to implement targeted solutions, reducing the occurrence of such exceptions in future loan transactions.
Additionally, recognizing the diversity of payment outcomes underscores the importance of having a flexible and adaptive payment processing system. By accommodating various payment statuses and addressing exceptional cases promptly, MoneyLion can further enhance customer satisfaction, streamline payment procedures, and maintain the trust of applicants throughout the loan repayment process.
From the analysis above, several data issues have been identified, including missing values, outliers and naming convention discrepancies.
Missing Values : Missing values can occur for various reasons, such as data entry errors, system glitches, or incomplete information during data collection. When handling missing values, it's essential to assess the extent of missing data for each variable. Depending on the proportion of missing values, there are several strategies to consider:
Imputation: Missing values can be estimated or imputed using statistical methods. Common imputation techniques include mean, median, or regression imputation, where missing values are replaced with estimated values based on other available data.
Exclusion: If the missing data is limited to a small portion of the dataset, excluding those records might be a viable option. However, this should be done carefully to avoid bias in the analysis.
Domain Knowledge: In some cases, domain experts can provide insights to impute missing values accurately, especially when dealing with specific variables or fields.
Outliers : Outliers are data points significantly different from the rest of the dataset. They can arise due to errors, natural variations, or exceptional circumstances. Managing outliers is crucial to prevent skewed analyses and misleading conclusions. Here are common approaches to handling outliers:
Visual Inspection: Data visualization tools like box plots, histograms, or scatter plots can help identify outliers visually. Observing the data distribution can provide insights into potential outliers.
Statistical Methods: Statistical techniques like the Interquartile Range (IQR) or Z-scores can be used to identify and remove outliers based on their deviation from the mean or median.
Transformations: Applying transformations like logarithms can sometimes normalize the data and reduce the impact of outliers on statistical analyses.
Contextual Understanding: Understanding the domain and the specific variables in the dataset is crucial. Outliers might actually represent significant events or rare occurrences in certain contexts and may need to be treated differently.
Naming Convention Discrepancies : It refers to inconsistencies or non-standardized ways in which variables, columns, or other identifiers are named. Such errors can occur due to various reasons, including different naming styles used by different contributors, lack of clear guidelines, or oversight during data collection or preprocessing.
Standardization: Define and enforce a consistent naming convention across all variables. This convention should cover capitalization, word separation (e.g., 1 versus True), and the use of abbreviations or acronyms.
Documentation: Clearly document the chosen naming convention to ensure that all team members follow the guidelines. Include examples and explanations for clarity.
Data Cleaning: If inconsistencies exist in the dataset, consider cleaning the variable names to adhere to the standardized convention. This may involve renaming columns in databases or restructuring variable names in programming scripts.
Addressing these data issues requires careful consideration of the dataset's characteristics and the objectives of the analysis. Proper handling of missing values and outliers ensures the accuracy and reliability of the insights drawn from the data, providing a solid foundation for informed decision-making in the context of loans and lending processes.
C1. Missing Values Handling
percent_missing = filter_data1.isnull().sum() * 100 / len(filter_data1)
sorted_percent_missing = percent_missing.sort_values(ascending=False)
print(sorted_percent_missing)
clearfraudscore 0.285892 fpStatus 0.015538 nPaidOff 0.006215 loanStatus 0.000000 hasCF 0.000000 leadCost 0.000000 leadType 0.000000 state 0.000000 originallyScheduledPaymentAmount 0.000000 loanAmount 0.000000 loanId 0.000000 anon_ssn 0.000000 approved 0.000000 originatedDate 0.000000 originated 0.000000 applicationDate 0.000000 apr 0.000000 payFrequency 0.000000 isFunded 0.000000 dtype: float64
Observation and Analysis :
When analyzing a dataset, it's important to assess the presence of missing values in the variables, as these gaps in the data can impact the accuracy and reliability of our analysis. In the dataset above, the following variables have missing values:
clearfraudscore : This variable has a relatively high percentage of missing values, approximately 0.2859%. When a substantial portion of data is missing, it can raise concerns about the representativeness of the available data. The missing values in 'clearfraudscore' may be due to various reasons, such as technical issues during data collection or certain cases where the fraud score couldn't be determined.
fpStatus : Around 0.0155% of the values in this variable are missing. While this percentage is lower than that of clearfraudscore, it's still important to investigate the missing values. The reason behind missing data in fpStatus could be specific to the nature of the variable or the data collection process.
nPaidOff : This variable has a relatively low percentage of missing values, approximately 0.0062%. Even though the percentage is small, it's essential to understand why these values are missing. Identifying the cause of missing data in nPaidOff can provide insights into the data collection procedure or potential issues with the variable itself.
Dealing with missing data involves careful consideration. Depending on the context and the extent of missing values, we might choose to handle them in various ways, such as imputation (filling in missing values with estimates), excluding rows with missing values, or considering the missingness as a separate category if it carries significance in the analysis.
Before proceeding with any analysis or modeling, it's crucial to decide on an appropriate strategy for handling missing data based on the specific requirements of our analysis and the domain knowledge. This ensures that our analysis is accurate and reliable, providing meaningful insights from the available data. Therefore, in the subsequent step, we will examine the three mentioned variables individually to identify optimal solutions for addressing the missing values.
clearfraudscore Variableplot_variable_distribution(filter_data1, 'clearfraudscore')
Observations : We notice that the data exhibits a slight normal distribution with a skewness to the left, indicating that the median value holds significance in this context. This suggests that the central tendency of the data is best represented by the median, making it a valuable measure for our analysis. Reference Link
temp = filter_data1.copy()
temp.loc[temp['clearfraudscore'].isna(),'clearfraudscore'] = temp['clearfraudscore'].median()
fpStatus VariableIn this context, the variable fpStatus in the dataset has a very small proportion of missing values, accounting for only 0.015538% of the total data. When dealing with a small percentage of missing values like this, one approach is to simply remove the rows containing those missing values. This strategy is feasible when the missing data is negligible compared to the overall dataset size, as dropping a small fraction of rows won't significantly impact the analysis. Reference Link
temp.dropna(subset=['fpStatus'], inplace=True)
temp.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 32175 entries, 0 to 36333 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loanId 32175 non-null object 1 anon_ssn 32175 non-null object 2 payFrequency 32175 non-null object 3 apr 32175 non-null float64 4 applicationDate 32175 non-null datetime64[ns] 5 originated 32175 non-null bool 6 originatedDate 32175 non-null datetime64[ns] 7 nPaidOff 32173 non-null float64 8 approved 32175 non-null bool 9 isFunded 32175 non-null int64 10 loanStatus 32175 non-null object 11 loanAmount 32175 non-null float64 12 originallyScheduledPaymentAmount 32175 non-null float64 13 state 32175 non-null object 14 leadType 32175 non-null object 15 leadCost 32175 non-null int64 16 fpStatus 32175 non-null object 17 hasCF 32175 non-null int64 18 clearfraudscore 32175 non-null float64 dtypes: bool(2), datetime64[ns](2), float64(5), int64(3), object(7) memory usage: 4.5+ MB
Observations : Consequently, there is only remained nPaidOff variable consists of a few missing values.
nPaidOff VariableFrom observation above, the choice to drop the two rows with missing values for the nPaidOff variable was made to ensure data accuracy, simplify the data processing steps, and maintain the dataset's integrity for subsequent analyses.
temp.dropna(subset=['nPaidOff'], inplace=True)
temp.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 32173 entries, 0 to 36333 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loanId 32173 non-null object 1 anon_ssn 32173 non-null object 2 payFrequency 32173 non-null object 3 apr 32173 non-null float64 4 applicationDate 32173 non-null datetime64[ns] 5 originated 32173 non-null bool 6 originatedDate 32173 non-null datetime64[ns] 7 nPaidOff 32173 non-null float64 8 approved 32173 non-null bool 9 isFunded 32173 non-null int64 10 loanStatus 32173 non-null object 11 loanAmount 32173 non-null float64 12 originallyScheduledPaymentAmount 32173 non-null float64 13 state 32173 non-null object 14 leadType 32173 non-null object 15 leadCost 32173 non-null int64 16 fpStatus 32173 non-null object 17 hasCF 32173 non-null int64 18 clearfraudscore 32173 non-null float64 dtypes: bool(2), datetime64[ns](2), float64(5), int64(3), object(7) memory usage: 4.5+ MB
In summary, after addressing missing values in all rows, the dataset now comprises a complete set of 32,173 observations.
C2. Naming Convention Discrepancies
In the dataset analysis, a significant issue regarding naming convention discrepancies has been identified. Specifically, the variables originated and approved are represented using the Boolean values True or False, while the isFunded variable uses numerical values 1 or 0.
To maintain consistency and improve data integrity, it is imperative to standardize the representation of these variables. The logical approach is to transform originated and approved variables to align with the numerical format used in isFunded.
Proposed Standardization :
originated Variable
isFunded.approved Variable
isFunded.By standardizing the representation of these variables, the dataset will exhibit uniformity, enhancing clarity and simplifying data processing and analysis. This standardized format will facilitate accurate comparisons, calculations, and modeling tasks, ensuring that the dataset adheres to a cohesive naming convention for these specific variables.
temp['originated'] = temp['originated'].astype(int)
temp['approved'] = temp['approved'].astype(int)
temp.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 32173 entries, 0 to 36333 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loanId 32173 non-null object 1 anon_ssn 32173 non-null object 2 payFrequency 32173 non-null object 3 apr 32173 non-null float64 4 applicationDate 32173 non-null datetime64[ns] 5 originated 32173 non-null int64 6 originatedDate 32173 non-null datetime64[ns] 7 nPaidOff 32173 non-null float64 8 approved 32173 non-null int64 9 isFunded 32173 non-null int64 10 loanStatus 32173 non-null object 11 loanAmount 32173 non-null float64 12 originallyScheduledPaymentAmount 32173 non-null float64 13 state 32173 non-null object 14 leadType 32173 non-null object 15 leadCost 32173 non-null int64 16 fpStatus 32173 non-null object 17 hasCF 32173 non-null int64 18 clearfraudscore 32173 non-null float64 dtypes: datetime64[ns](2), float64(5), int64(5), object(7) memory usage: 4.9+ MB
temp.head(10)
| loanId | anon_ssn | payFrequency | apr | applicationDate | originated | originatedDate | nPaidOff | approved | isFunded | loanStatus | loanAmount | originallyScheduledPaymentAmount | state | leadType | leadCost | fpStatus | hasCF | clearfraudscore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | LL-I-06644937 | 464f5d9ae4fa09ece4048d949191865c | B | 199.00 | 2016-01-19 22:07:36.778 | 1 | 2016-01-20 15:49:18.846 | 0.0 | 1 | 1 | Paid Off Loan | 3000.0 | 6395.19 | CA | prescreen | 0 | Checked | 1 | 879.0 |
| 1 | LL-I-17826953 | 7529d29c9d943cc55bc71a85d9c6b7a3 | W | 645.00 | 2017-03-01 15:55:41.823 | 1 | 2017-03-01 16:15:13.016 | 0.0 | 1 | 1 | New Loan | 400.0 | 1364.87 | NJ | lead | 40 | Checked | 1 | 658.0 |
| 2 | LL-I-16519347 | bc9fa73468acf18cb968a3a6a3c4697c | S | 490.00 | 2017-01-31 17:48:17.381 | 1 | 2017-01-31 18:00:14.520 | 2.0 | 1 | 1 | Paid Off Loan | 1250.0 | 3313.78 | MO | organic | 0 | Checked | 1 | 835.0 |
| 3 | LL-I-17491287 | bc9fa73468acf18cb968a3a6a3c4697c | S | 490.00 | 2017-02-21 17:20:23.671 | 1 | 2017-02-21 17:45:11.088 | 3.0 | 1 | 1 | Paid Off Loan | 1250.0 | 3112.16 | MO | organic | 0 | Checked | 1 | 835.0 |
| 4 | LL-I-15679301 | bc9fa73468acf18cb968a3a6a3c4697c | S | 490.00 | 2017-01-18 01:16:24.941 | 1 | 2017-01-18 01:30:16.712 | 1.0 | 1 | 1 | Paid Off Loan | 500.0 | 1292.09 | MO | organic | 0 | Checked | 1 | 835.0 |
| 5 | LL-I-12500318 | bc9fa73468acf18cb968a3a6a3c4697c | S | 490.00 | 2016-12-05 06:19:59.756 | 1 | 2016-12-05 06:45:04.331 | 0.0 | 1 | 1 | Paid Off Loan | 500.0 | 1271.81 | MO | bvMandatory | 6 | Checked | 1 | 835.0 |
| 6 | LL-I-11506719 | 10cefd11d41612b032fd9cd2debe32dc | B | 590.00 | 2016-08-27 09:08:05.442 | 1 | 2016-08-29 15:05:04.014 | 0.0 | 1 | 1 | Paid Off Loan | 800.0 | 2432.08 | CA | bvMandatory | 10 | Checked | 1 | 624.0 |
| 7 | LL-I-18309774 | 10cefd11d41612b032fd9cd2debe32dc | B | 159.00 | 2017-03-13 03:55:39.599 | 1 | 2017-03-13 21:15:16.293 | 1.0 | 1 | 1 | New Loan | 2600.0 | 8116.01 | CA | organic | 0 | Checked | 1 | 624.0 |
| 8 | LL-I-09279047 | 1d249e27292d25e4d3b4bc142a2a4aea | W | 404.10 | 2016-05-25 03:05:18.849 | 1 | 2016-05-25 15:11:36.260 | 3.0 | 1 | 1 | Paid Off Loan | 300.0 | 422.64 | WI | organic | 0 | Checked | 1 | 770.0 |
| 9 | LL-I-10219067 | 1d249e27292d25e4d3b4bc142a2a4aea | W | 381.65 | 2016-07-08 16:31:13.341 | 1 | 2016-07-08 21:28:07.169 | 4.0 | 1 | 1 | Paid Off Loan | 350.0 | 492.90 | WI | organic | 0 | Checked | 1 | 770.0 |
In brief, after correcting the naming convention error, a uniform and consistent naming pattern is now observed across all variables.
Proceeding with analysis, the next objective is to establish a target variable that signifies whether a specific loan application has experienced a default or not. To accomplish this, we direct our attention to the loanStatus column, which provides a broad perspective on the outcomes of each application.
Rationale
Objective for Creating a Target Variable: The primary goal at this stage of the analysis is to create a target variable that can be used for predictive modeling. This target variable will help in categorizing loan applications into two distinct groups: those that have experienced a default and those that have not.
Utilizing the loanStatus Column: The loanStatus column contains valuable information about the overall status or outcome of each loan application. It serves as a key indicator that allows us to categorize the applications into meaningful groups based on their repayment status.
Default Classification: By examining the information within the loanStatus column, we can define a criterion or rule to classify loan applications as either "defaulted" or "not defaulted.", or simply words : "1" or "0". This classification is essential for various types of analyses and modeling tasks related to loan risk assessment and prediction.
Importance of Target Variable: The target variable, which signifies loan defaults, is a crucial component in the context of risk assessment and decision-making. It helps in understanding and predicting loan default patterns, which are essential for effective risk management in the financial industry.
normalized_counts = temp['loanStatus'].value_counts(normalize=True) * 100
# Display the normalized counts
print(normalized_counts)
External Collection 28.996363 Paid Off Loan 28.241072 New Loan 20.302738 Internal Collection 15.957480 Returned Item 3.266714 Settlement Paid Off 1.665993 Settled Bankruptcy 0.879620 Pending Paid Off 0.348118 Charged Off Paid Off 0.338793 Charged Off 0.003108 Name: loanStatus, dtype: float64
Analysis : Below are my own analysis and assumptions made for each unique value counts:
External Collection (29%) : In the context of loan default, it refers to the process where a MoneyLion outsources the task of collecting overdue or defaulted loan payments to a third-party collection agency. In the case of predicting a loan default / quality of applicants, this does not suits the objective. Thus, we could exclude this as our target variable consideration. Paid Off Loan (28.24%) : In the context of loan default, it refers to a situation where an applicant has successfully repaid the entire outstanding loan amount as per the agreed terms and conditions outlined in the loan agreement. As our objective of this modeling, this served the purpose of it. Thus, we could include this as our target variable consideration.New Loan (20.30%) : In the context of loan default, it refers to a new applicant of loan from MoneyLion. As it does not related to the default risk of applicants where borrowing and lending are yet to taken place, we could exclude this from our target variable consideration.Internal Collection (15.96%) : In the context of loan default, it refers to the process through which a MoneyLion handles the collection of overdue or defaulted loan payments internally, without involving external third-party agencies. In our prediction model of whether the loan will be defaulted, this element showed of how the loan is being collected. Importantly, it showed a method of collection, rather than defaulting a loan. Thus, we could exclude this from our target variable consideration.Returned Item (3.27%) : In the context of loan default, it refers to the loan applicants that have missed 1 payment (but not more), due to insufficient funds. For this consideration, we could exclude this from our target variable consideration. The rationale behind it is that we want to make sure that the predicted accuracy of all applicants has the financial capability in repaying the loan on a timely manner. Settlement Paid Off (1.67%) : In the context of loan default, it refers to the applicants has fully settled any outstanding loan amount. It is related to our target variables as these are data that applicants have no default risk on loan. Thus, we could include it. Settled Backruptcy (0.88%) : In the context of loan default, it refers to the loan applicants have decleared backruptcy and have no financial ability to repay the loan. Such element is ultimately related to our prediction model's objective. Thus, we could include it from our target variable considerations.Pending Paid Off (0.35%): In the context of loan default, it indicates a loan status where the applicant has fulfilled most of their payment obligations, and the loan is in the process of being paid off completely. However, there might be some remaining payments or final steps required before the loan is officially marked as "Paid Off.". Thus, in our modelling objective, we could exclude from our target variable considerations as we only want an accurate model that predict the loan will be defaulted or not.Charged Off Paid Off (0.34%) : In the context of loan default, it can be a complex term that refers to a specific scenario where a charged-off loan. Thus, for the simplicity of the model, we could exclude this elements. Charged Off (0.0031%) : In the context of loan default, it refers to a status assigned to a loan by MoneyLion when the applicant has failed to make payments for an extended period, and there is little expectation of recovering the full amount owed. Thus, it showed that the applicants has failed to fultill their financial obligations. Thus, we could include this in our target variable consideration.In short,
Elements
| Included | Excluded |
| --- | --- |
| Paid Off Loan | External Collection |
| Settlement Paid Off | New Loan |
| Settled Backruptcy | Internal Collection |
| Charged Off | Returned Item |
| | Pending Paid Off |
| | Charged Off Paid Off |
Target Variable
| Default ( 1 )| No Default ( 0 )|
| --- | --- |
| Settled Backruptcy | Paid Off Loan |
| Charged Off | Settlement Paid Off |
By employing this approach, we can generate a fresh target column in the dataset, assigning binary values to indicate the favorability of the loanStatus. A value of 1 represents a loan default outcome, while 0 signifies a non-default outcome. All other entries in the dataset will be excluded from this target column, streamlining the data for analysis.
temp['target'] = temp['loanStatus'].map(target_mapping)
# Drop rows with no target label (implying their status is ongoing)
temp = temp[temp['target'].notna()].reset_index(drop=True)
normalized_counts = temp['target'].value_counts(normalize=True) * 100
# Display the normalized counts
print(normalized_counts)
0.0 97.133051 1.0 2.866949 Name: target, dtype: float64
Observations : The target variable distribution shown above indicates a significant imbalance in the dataset. In this context, an imbalanced dataset means that the classes (0 and 1) in the target variable are not represented equally. Let's break down the provided distribution:
Class 0 (Favorable Outcome): The target value 0 accounts for approximately 97.13% of the dataset. This suggests that a vast majority of the instances in the dataset belong to the class indicating an favorable outcome (such as loan has no default or any other positive event).
Class 1 (Unfavorable Outcome): On the other hand, the target value 1 represents only about 2.87% of the dataset. This class signifies a unfavorable outcome (like default loan repayment or a negative event), and it is significantly underrepresented compared to class 0.
Analysis :
Imbalanced Data: Having such a substantial disparity between the two classes (favorable and unfavorable outcomes) creates an imbalanced dataset. Imbalanced data can impact the performance of machine learning models, particularly those designed to predict rare events or minority classes. In this case, predicting class 1 accurately becomes challenging due to its scarcity in the dataset.
Challenges in Modeling: When dealing with imbalanced data, machine learning models might exhibit a bias towards the majority class. This means the model could become overly optimized for predicting the majority class (class 0), making it less effective at identifying instances of the minority class (class 1). As a result, the model's ability to recognize favorable outcomes could be compromised.
Addressing Imbalance: To mitigate the effects of class imbalance, techniques like oversampling the minority class, undersampling the majority class, or using algorithms specifically designed for imbalanced data (such as SMOTE - Synthetic Minority Over-sampling Technique) can be employed. These techniques help balance the class distribution and improve the model's ability to predict both classes accurately.
Careful Evaluation: When evaluating the model's performance, metrics like precision, recall, F1-score, and area under the ROC curve (AUC-ROC) are often used instead of accuracy. These metrics provide a more comprehensive understanding of the model's ability to correctly predict both classes, considering the class imbalance.
temp.drop(columns = ['loanStatus'],inplace = True)
Rationale : At this point, it's evident that we can safely remove the loanStatus column from the dataset. This decision is made possible because we have successfully extracted and condensed all the essential information from loanStatus into the newly created target variable.
E1. Drop hasCF Variable
Rationale : I have decided to exclude the hasCF column from our analysis for a couple of reasons. Firstly, the instructions provided did not offer any specific information about the content or purpose of this column. Secondly, considering the likelihood that "CF" stands for "clear fraud," it is reasonable to assume that the essential information contained in hasCF is already covered by the clearfraudscore column. Our assumption here is that the underwriting team likely rejected most fraudulent applications, making clearfraudscore a comprehensive indicator.
However, it's important to note that in a real-world scenario, it is advisable to seek clarification or confirmation before excluding any data attribute from consideration. Clarity about the data's meaning and relevance is crucial to making informed decisions during the analysis process.
temp.drop(columns = ['hasCF'],inplace = True)
temp.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9906 entries, 0 to 9905 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loanId 9906 non-null object 1 anon_ssn 9906 non-null object 2 payFrequency 9906 non-null object 3 apr 9906 non-null float64 4 applicationDate 9906 non-null datetime64[ns] 5 originated 9906 non-null int64 6 originatedDate 9906 non-null datetime64[ns] 7 nPaidOff 9906 non-null float64 8 approved 9906 non-null int64 9 isFunded 9906 non-null int64 10 loanAmount 9906 non-null float64 11 originallyScheduledPaymentAmount 9906 non-null float64 12 state 9906 non-null object 13 leadType 9906 non-null object 14 leadCost 9906 non-null int64 15 fpStatus 9906 non-null object 16 clearfraudscore 9906 non-null float64 17 target 9906 non-null float64 dtypes: datetime64[ns](2), float64(6), int64(4), object(6) memory usage: 1.4+ MB
E2. Add durationTime Variable
Rationale : Upon conducting a comprehensive examination of each variable, a noteworthy insight emerged. Specifically, there is potential in deriving a significant feature by calculating the difference between the applicationDate and originatedDate variables. This calculated feature serves as a valuable indicator, offering insights into the duration – whether measured in hours or days – that a loan application took from the initial application stage to its origination.
Relevance as a Predictor : This time-based feature holds substantial relevance when it comes to predicting the risk profile of loan applicants. The duration from application to origination can provide crucial insights into the efficiency and promptness of the lending process. Longer durations might indicate delays or complexities in the approval process, potentially reflecting higher risks associated with the applicant or the application itself. Detail evidence is required to be investigate as supporting materials.
Potential Use in Modeling : Integrating this derived feature into predictive models can enhance their accuracy and effectiveness. By considering the time taken for loan processing as a predictor, MoneyLion can refine their risk assessment models. This, in turn, enables them to make more informed decisions, optimizing the approval process and ensuring a more streamlined and efficient lending operation.
temp['durationTime'] = (temp['originatedDate'] - temp['applicationDate']).dt.total_seconds() / 3600
temp.head(10)
| loanId | anon_ssn | payFrequency | apr | applicationDate | originated | originatedDate | nPaidOff | approved | isFunded | loanAmount | originallyScheduledPaymentAmount | state | leadType | leadCost | fpStatus | clearfraudscore | target | durationTime | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | LL-I-06644937 | 464f5d9ae4fa09ece4048d949191865c | B | 199.00 | 2016-01-19 22:07:36.778 | 1 | 2016-01-20 15:49:18.846 | 0.0 | 1 | 1 | 3000.0 | 6395.19 | CA | prescreen | 0 | Checked | 879.0 | 0.0 | 17.695019 |
| 1 | LL-I-16519347 | bc9fa73468acf18cb968a3a6a3c4697c | S | 490.00 | 2017-01-31 17:48:17.381 | 1 | 2017-01-31 18:00:14.520 | 2.0 | 1 | 1 | 1250.0 | 3313.78 | MO | organic | 0 | Checked | 835.0 | 0.0 | 0.199205 |
| 2 | LL-I-17491287 | bc9fa73468acf18cb968a3a6a3c4697c | S | 490.00 | 2017-02-21 17:20:23.671 | 1 | 2017-02-21 17:45:11.088 | 3.0 | 1 | 1 | 1250.0 | 3112.16 | MO | organic | 0 | Checked | 835.0 | 0.0 | 0.413171 |
| 3 | LL-I-15679301 | bc9fa73468acf18cb968a3a6a3c4697c | S | 490.00 | 2017-01-18 01:16:24.941 | 1 | 2017-01-18 01:30:16.712 | 1.0 | 1 | 1 | 500.0 | 1292.09 | MO | organic | 0 | Checked | 835.0 | 0.0 | 0.231047 |
| 4 | LL-I-12500318 | bc9fa73468acf18cb968a3a6a3c4697c | S | 490.00 | 2016-12-05 06:19:59.756 | 1 | 2016-12-05 06:45:04.331 | 0.0 | 1 | 1 | 500.0 | 1271.81 | MO | bvMandatory | 6 | Checked | 835.0 | 0.0 | 0.417938 |
| 5 | LL-I-11506719 | 10cefd11d41612b032fd9cd2debe32dc | B | 590.00 | 2016-08-27 09:08:05.442 | 1 | 2016-08-29 15:05:04.014 | 0.0 | 1 | 1 | 800.0 | 2432.08 | CA | bvMandatory | 10 | Checked | 624.0 | 0.0 | 53.949603 |
| 6 | LL-I-09279047 | 1d249e27292d25e4d3b4bc142a2a4aea | W | 404.10 | 2016-05-25 03:05:18.849 | 1 | 2016-05-25 15:11:36.260 | 3.0 | 1 | 1 | 300.0 | 422.64 | WI | organic | 0 | Checked | 770.0 | 0.0 | 12.104836 |
| 7 | LL-I-10219067 | 1d249e27292d25e4d3b4bc142a2a4aea | W | 381.65 | 2016-07-08 16:31:13.341 | 1 | 2016-07-08 21:28:07.169 | 4.0 | 1 | 1 | 350.0 | 492.90 | WI | organic | 0 | Checked | 770.0 | 0.0 | 4.948286 |
| 8 | LL-I-07272718 | 1d249e27292d25e4d3b4bc142a2a4aea | W | 449.99 | 2016-02-16 17:21:44.967 | 1 | 2016-02-16 18:21:59.369 | 1.0 | 1 | 1 | 500.0 | 726.77 | WI | organic | 0 | Checked | 770.0 | 0.0 | 1.004001 |
| 9 | LL-I-10413611 | bc500677542783c43842d54e34830f08 | B | 324.00 | 2016-07-21 16:06:26.718 | 1 | 2016-07-21 17:10:56.462 | 1.0 | 1 | 1 | 1250.0 | 2581.47 | IL | organic | 0 | Checked | 932.0 | 0.0 | 1.074929 |
E3. Add duration_from_last Variable
Rationale : After a thorough investigation of each variable, a significant observation came to light. Specifically, there is potential in creating a meaningful feature by calculating the time difference in days between the dates of former and subsequent loan applications. This calculation provides valuable information about the duration between consecutive applications, potentially influencing the risk profile. However, it's crucial to conduct further analysis and gather supporting statistical evidence to validate these findings. For fair treatment, a value of "-1" should be awarded to those non-returning clients indicating they do not apply any new loan after the last application.
temp = temp.sort_values(by = ['anon_ssn', 'applicationDate']).copy()
temp['duration_from_last'] = temp.groupby('anon_ssn')['applicationDate'].diff().dt.total_seconds() / (24 * 60 * 60)
temp['duration_from_last'] = temp['duration_from_last'].fillna(-1)
temp
| loanId | anon_ssn | payFrequency | apr | applicationDate | originated | originatedDate | nPaidOff | approved | isFunded | loanAmount | originallyScheduledPaymentAmount | state | leadType | leadCost | fpStatus | clearfraudscore | target | durationTime | duration_from_last | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7600 | LL-I-13612582 | 00006b47f484731c95b4c11a42f9c2e3 | B | 645.0 | 2016-12-21 02:57:57.321 | 1 | 2016-12-21 03:15:05.820 | 0.0 | 1 | 1 | 500.0 | 1728.52 | KY | lead | 10 | Checked | 736.0 | 0.0 | 0.285694 | -1.000000 |
| 5320 | LL-I-10308424 | 00069c6c4fa5fa1b94b845ffde8e081b | W | 360.0 | 2016-07-15 20:19:00.936 | 1 | 2016-07-15 21:15:51.841 | 0.0 | 1 | 1 | 300.0 | 670.49 | IL | bvMandatory | 3 | Checked | 545.0 | 0.0 | 0.947474 | -1.000000 |
| 8472 | LL-I-12250877 | 000c0aa9b221e0ca8bb76dfd55d3847c | S | 590.0 | 2016-11-28 15:14:54.192 | 1 | 2016-11-28 16:34:33.602 | 1.0 | 1 | 1 | 600.0 | 1699.35 | CT | rc_returning | 0 | Checked | 872.0 | 0.0 | 1.327614 | -1.000000 |
| 4539 | LL-I-09394172 | 00166d34c1276616154d830a9ef2f620 | B | 590.0 | 2016-05-30 19:58:51.102 | 1 | 2016-05-30 21:14:33.128 | 0.0 | 1 | 1 | 700.0 | 2040.27 | NV | bvMandatory | 10 | Checked | 708.0 | 0.0 | 1.261674 | -1.000000 |
| 6501 | LL-I-05743901 | 0019a85f11c8d85805ea90b4deb461a4 | B | 360.0 | 2015-12-15 14:55:26.232 | 1 | 2015-12-15 20:19:29.015 | 0.0 | 1 | 1 | 500.0 | 996.44 | IL | bvMandatory | 6 | Checked | 818.0 | 0.0 | 5.400773 | -1.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6709 | LL-I-09345602 | ffdf8cca26a3364a57dd42c55aa88b69 | B | 324.0 | 2016-05-27 15:58:17.443 | 1 | 2016-05-30 14:06:29.465 | 1.0 | 1 | 1 | 300.0 | 600.16 | IL | organic | 0 | Checked | 837.0 | 0.0 | 70.136673 | 170.894771 |
| 6710 | LL-I-09770357 | ffdf8cca26a3364a57dd42c55aa88b69 | M | 324.0 | 2016-06-16 21:10:15.243 | 1 | 2016-06-17 14:24:32.429 | 2.0 | 1 | 1 | 1350.0 | 2691.91 | IL | organic | 0 | Rejected | 837.0 | 0.0 | 17.238107 | 20.216641 |
| 1828 | LL-I-11169363 | fff10bad3cc5bb482c50388aacb047f0 | M | 590.0 | 2016-08-16 19:01:24.719 | 1 | 2016-08-17 18:23:55.122 | 0.0 | 1 | 1 | 500.0 | 1327.28 | WI | bvMandatory | 3 | Checked | 722.0 | 0.0 | 23.375112 | -1.000000 |
| 5433 | LL-I-08701976 | fff409842c197bd4804aadba0838d90c | W | 650.0 | 2016-04-14 13:33:54.581 | 1 | 2016-04-14 21:45:44.533 | 1.0 | 1 | 1 | 400.0 | 1350.68 | MI | organic | 0 | Checked | 811.0 | 0.0 | 8.197209 | -1.000000 |
| 4700 | LL-I-18535621 | fffa10032c34c5c03e76b208dc664f70 | S | 490.0 | 2017-03-17 01:48:51.748 | 1 | 2017-03-17 02:00:09.397 | 0.0 | 1 | 1 | 500.0 | 1299.24 | MO | lead | 25 | Checked | 766.0 | 0.0 | 0.188236 | -1.000000 |
9906 rows × 20 columns
Feature selection, also known as variable selection or attribute selection, is a crucial step in data analysis and predictive modeling. It involves the automatic identification of attributes in our dataset that hold the utmost relevance in predicting the risk profile of a loan applicant. The primary objective is to discern and eliminate unnecessary, irrelevant, or redundant attributes from the dataset.
Significance of Feature Selection
Enhancing Predictive Accuracy: By selecting only the most relevant features, we optimize the accuracy and efficiency of our predictive models. Focusing on pertinent attributes ensures that the model concentrates on factors that genuinely influence the loan applicant's risk level.
Avoiding Overhead: Including redundant or irrelevant attributes can introduce noise into the model, leading to overfitting. Feature selection helps prevent this issue by filtering out unnecessary data, ensuring the model's focus is on meaningful patterns.
Simplifying Interpretation: A streamlined dataset with carefully chosen features simplifies the interpretation of the model's outcomes. Decision-makers can grasp the key factors affecting loan risk without unnecessary complexity.
Challenges and Considerations
Domain Knowledge: Effective feature selection often requires domain expertise to discern which attributes are truly relevant. Knowledge about the loan industry and the specific factors influencing loan approvals is invaluable.
Techniques and Algorithms: Various techniques, such as statistical tests, correlation analyses, or machine learning algorithms, can aid in feature selection. Choosing the appropriate method depends on the dataset's characteristics and the problem at hand.
Iterative Process: Feature selection is often an iterative process, evolving as the analysis progresses. Continuous evaluation and refinement of selected features can lead to more accurate predictive models.
Thus, a few methods are undertaken to improve the accuracy of the model.
F1. Correlation Analysis
Rationale : In the realm of data analysis, understanding correlations is vital. When we integrate highly correlated data into our model, it can lead to overfitting issues. For instance, if we encounter two features with significant correlations, we need to make a decision. It is essential to exclude the one that displays stronger correlation with other features. This careful selection ensures our model avoids the pitfalls of overfitting, enhancing its accuracy and reliability. Noted that correlation analysis only workable for numerical columns. Thus, we need to exclude those non-numerical columns for correlation analysis.
temp1 = temp[['apr','nPaidOff','loanAmount','originallyScheduledPaymentAmount','leadCost','clearfraudscore','target']]
HeatMap(temp1,x=True)
Observations : Upon analyzing the provided heatmap, a notable observation is the high correlation coefficient of 0.94 between two variables, specifically originallyScheduledPaymentAmount and loanAmount. This correlation value signifies a strong positive linear relationship between these two variables. In simpler terms, as one variable increases, the other variable tends to increase proportionally, and vice versa.
Analysis : The correlation coefficient of 0.94 indicates a very strong positive correlation between originallyScheduledPaymentAmount and loanAmount. This suggests that there is a robust connection between the initially scheduled payment amount and the total loan amount. Changes in one variable are highly reflective of changes in the other, indicating a predictable pattern. As for now, there is no immediate cause for concern regarding multicollinearity among these two variables. However, if our model starts exhibiting signs of generating unreliable or unstable predictions, we may investigate the impact of multicollinearity. In such a scenario, the correlated features, in this case, loanAmount and originallyScheduledPaymentAmount, will be examined more closely. If necessary, and depending on the model's performance, a decision may be made to drop one of the two correlated features. This precautionary step aims to address multicollinearity issues and enhance the overall stability and reliability of the predictive model. Regular monitoring and diagnostic checks will guide us in making informed decisions regarding feature adjustments.
F2. Drop applicationDate and originatedDate Variable
Rationale : We continue the feature selection process by excluding applicationDate and originatedDate variable from our list of features. This decision is grounded in several justifications. Firstly, an examination of the data reveals no discernible or significant patterns in the application and originated dates. Secondly, there is no compelling reason to believe that these dates, in isolation, hold predictive value for loan outcomes. Lastly, we have already consolidated relevant temporal information into time_to_originate and time_since_last, making the inclusion of the original date features redundant in our analysis.
temp.drop(columns=['applicationDate', 'originatedDate'], inplace=True)
clean_data = temp.copy()
clean_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 9906 entries, 7600 to 4700 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loanId 9906 non-null object 1 anon_ssn 9906 non-null object 2 payFrequency 9906 non-null object 3 apr 9906 non-null float64 4 originated 9906 non-null int64 5 nPaidOff 9906 non-null float64 6 approved 9906 non-null int64 7 isFunded 9906 non-null int64 8 loanAmount 9906 non-null float64 9 originallyScheduledPaymentAmount 9906 non-null float64 10 state 9906 non-null object 11 leadType 9906 non-null object 12 leadCost 9906 non-null int64 13 fpStatus 9906 non-null object 14 clearfraudscore 9906 non-null float64 15 target 9906 non-null float64 16 durationTime 9906 non-null float64 17 duration_from_last 9906 non-null float64 dtypes: float64(8), int64(4), object(6) memory usage: 1.4+ MB
Observations and Analysis : From the summary demonstrated above as pre-screening stage, we observe multiple columns identified as 'object' data type. Thus, it meant that those 'object' data type variables are discrete. We will handle them as follows:
loanId and anon_ssn are unique identifiers and require no additional action, as they are the columns that accepts unique values for each row.payFrequency may hold significance, having unique few categorical data such as W,B,S,M,T, it require further attention for modeling purpose later.loanStatus will be dealt with during target variable creation and require no additional action as for now.state contains demographic data, in which it may affect the loan risk of an applicant. Thus, it require further attention.leadType may potentially affecting the loan risk prediction. Thus, without further analysis was taken place, it require further attention. fpStatus contains data that won't be available during model prediction (target leakage), leading to the decision to drop this column. Reference LinkTherefore, in the subsequent step, we will examine the above-mentioned variables individually to identify optimal solutions for addressing the discrete variable for modeling.
G1. payFrequency , state, leadType Variable
Rationale : Considering that payFrequency, state, and leadType are categorical and nominal variables, representing unordered and mutually exclusive categories, label encoding is deemed suitable for transforming them into numerical representations. It is a technique that is used to convert categorical columns into numerical ones so that they can be fitted by machine learning models which only take numerical data.
temp = clean_data.copy()
temp
| loanId | anon_ssn | payFrequency | apr | originated | nPaidOff | approved | isFunded | loanAmount | originallyScheduledPaymentAmount | state | leadType | leadCost | fpStatus | clearfraudscore | target | durationTime | duration_from_last | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7600 | LL-I-13612582 | 00006b47f484731c95b4c11a42f9c2e3 | B | 645.0 | 1 | 0.0 | 1 | 1 | 500.0 | 1728.52 | KY | lead | 10 | Checked | 736.0 | 0.0 | 0.285694 | -1.000000 |
| 5320 | LL-I-10308424 | 00069c6c4fa5fa1b94b845ffde8e081b | W | 360.0 | 1 | 0.0 | 1 | 1 | 300.0 | 670.49 | IL | bvMandatory | 3 | Checked | 545.0 | 0.0 | 0.947474 | -1.000000 |
| 8472 | LL-I-12250877 | 000c0aa9b221e0ca8bb76dfd55d3847c | S | 590.0 | 1 | 1.0 | 1 | 1 | 600.0 | 1699.35 | CT | rc_returning | 0 | Checked | 872.0 | 0.0 | 1.327614 | -1.000000 |
| 4539 | LL-I-09394172 | 00166d34c1276616154d830a9ef2f620 | B | 590.0 | 1 | 0.0 | 1 | 1 | 700.0 | 2040.27 | NV | bvMandatory | 10 | Checked | 708.0 | 0.0 | 1.261674 | -1.000000 |
| 6501 | LL-I-05743901 | 0019a85f11c8d85805ea90b4deb461a4 | B | 360.0 | 1 | 0.0 | 1 | 1 | 500.0 | 996.44 | IL | bvMandatory | 6 | Checked | 818.0 | 0.0 | 5.400773 | -1.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6709 | LL-I-09345602 | ffdf8cca26a3364a57dd42c55aa88b69 | B | 324.0 | 1 | 1.0 | 1 | 1 | 300.0 | 600.16 | IL | organic | 0 | Checked | 837.0 | 0.0 | 70.136673 | 170.894771 |
| 6710 | LL-I-09770357 | ffdf8cca26a3364a57dd42c55aa88b69 | M | 324.0 | 1 | 2.0 | 1 | 1 | 1350.0 | 2691.91 | IL | organic | 0 | Rejected | 837.0 | 0.0 | 17.238107 | 20.216641 |
| 1828 | LL-I-11169363 | fff10bad3cc5bb482c50388aacb047f0 | M | 590.0 | 1 | 0.0 | 1 | 1 | 500.0 | 1327.28 | WI | bvMandatory | 3 | Checked | 722.0 | 0.0 | 23.375112 | -1.000000 |
| 5433 | LL-I-08701976 | fff409842c197bd4804aadba0838d90c | W | 650.0 | 1 | 1.0 | 1 | 1 | 400.0 | 1350.68 | MI | organic | 0 | Checked | 811.0 | 0.0 | 8.197209 | -1.000000 |
| 4700 | LL-I-18535621 | fffa10032c34c5c03e76b208dc664f70 | S | 490.0 | 1 | 0.0 | 1 | 1 | 500.0 | 1299.24 | MO | lead | 25 | Checked | 766.0 | 0.0 | 0.188236 | -1.000000 |
9906 rows × 18 columns
label_encoder = preprocessing.LabelEncoder()
temp['payFrequency']= label_encoder.fit_transform(temp['payFrequency'])
temp['state']= label_encoder.fit_transform(temp['state'])
temp['leadType']= label_encoder.fit_transform(temp['leadType'])
clean_data1 = temp.copy()
clean_data1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 9906 entries, 7600 to 4700 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loanId 9906 non-null object 1 anon_ssn 9906 non-null object 2 payFrequency 9906 non-null int64 3 apr 9906 non-null float64 4 originated 9906 non-null int64 5 nPaidOff 9906 non-null float64 6 approved 9906 non-null int64 7 isFunded 9906 non-null int64 8 loanAmount 9906 non-null float64 9 originallyScheduledPaymentAmount 9906 non-null float64 10 state 9906 non-null int64 11 leadType 9906 non-null int64 12 leadCost 9906 non-null int64 13 fpStatus 9906 non-null object 14 clearfraudscore 9906 non-null float64 15 target 9906 non-null float64 16 durationTime 9906 non-null float64 17 duration_from_last 9906 non-null float64 dtypes: float64(8), int64(7), object(3) memory usage: 1.4+ MB
Observations : Thus, as shown from above cleaned data, we can observe that the variable payFrequency,state,leadType variables have transformed into numeric representative.
G2. fpStatus Variable
Rationale : From our analysis, understand that the fpStatus variable is quite similar to our target variable. In which, to avoid any data leakage in training the model, we decide to remove it.
temp = clean_data1.copy()
temp = temp.drop(columns = ['fpStatus'])
clean_data2 = temp.copy()
clean_data2.head(5)
| loanId | anon_ssn | payFrequency | apr | originated | nPaidOff | approved | isFunded | loanAmount | originallyScheduledPaymentAmount | state | leadType | leadCost | clearfraudscore | target | durationTime | duration_from_last | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7600 | LL-I-13612582 | 00006b47f484731c95b4c11a42f9c2e3 | 0 | 645.0 | 1 | 0.0 | 1 | 1 | 500.0 | 1728.52 | 14 | 4 | 10 | 736.0 | 0.0 | 0.285694 | -1.0 |
| 5320 | LL-I-10308424 | 00069c6c4fa5fa1b94b845ffde8e081b | 4 | 360.0 | 1 | 0.0 | 1 | 1 | 300.0 | 670.49 | 11 | 0 | 3 | 545.0 | 0.0 | 0.947474 | -1.0 |
| 8472 | LL-I-12250877 | 000c0aa9b221e0ca8bb76dfd55d3847c | 3 | 590.0 | 1 | 1.0 | 1 | 1 | 600.0 | 1699.35 | 4 | 8 | 0 | 872.0 | 0.0 | 1.327614 | -1.0 |
| 4539 | LL-I-09394172 | 00166d34c1276616154d830a9ef2f620 | 0 | 590.0 | 1 | 0.0 | 1 | 1 | 700.0 | 2040.27 | 25 | 0 | 10 | 708.0 | 0.0 | 1.261674 | -1.0 |
| 6501 | LL-I-05743901 | 0019a85f11c8d85805ea90b4deb461a4 | 0 | 360.0 | 1 | 0.0 | 1 | 1 | 500.0 | 996.44 | 11 | 0 | 6 | 818.0 | 0.0 | 5.400773 | -1.0 |
clean_data2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 9906 entries, 7600 to 4700 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loanId 9906 non-null object 1 anon_ssn 9906 non-null object 2 payFrequency 9906 non-null int64 3 apr 9906 non-null float64 4 originated 9906 non-null int64 5 nPaidOff 9906 non-null float64 6 approved 9906 non-null int64 7 isFunded 9906 non-null int64 8 loanAmount 9906 non-null float64 9 originallyScheduledPaymentAmount 9906 non-null float64 10 state 9906 non-null int64 11 leadType 9906 non-null int64 12 leadCost 9906 non-null int64 13 clearfraudscore 9906 non-null float64 14 target 9906 non-null float64 15 durationTime 9906 non-null float64 16 duration_from_last 9906 non-null float64 dtypes: float64(8), int64(7), object(2) memory usage: 1.4+ MB
The following screenshot offers an overview of the methodology employed during the modeling phase.
Typically, data is partitioned into a training set and a test set, with the test set left untouched until the final stage of model evaluation. Subsequently, the training set is further divided into training and validation sets, with the validation set utilized for model tuning.
The conventional train-test approach is susceptible to the high variance test problem, wherein alterations to the test set can lead to fluctuations in prediction outcomes. To mitigate this issue, we employ the k-fold validation method within our training and validation sets. This involves dividing the training set into k subsets, training the model on k-1 subsets, and validating on the remaining subset. This process is repeated k times, ensuring each subset serves as the validation set exactly once. The results are then averaged, providing a more robust evaluation that diminishes the impact of variability associated with a single train-test split.
In the field of machine learning, the quest for improving models is perpetual. As data scientists and practitioners, our dedication lies in consistently refining the efficacy of our algorithms. Despite the allure of intricate and advanced models, it is crucial to recognize the effectiveness of linear algorithms. This subsession will delve into "spot-checking" algorithms and uncover its potential to yield noteworthy outcomes (Sadaf, 2023). As such, below presented our initial modeling steps and results.
Below is a brief explanation of each model used in this project:
Linear Discriminant Analysis (LDA)
Classification and Regression Tree (CART)
Naive Bayes (NB)
Support Vector Machines (SVM)
Ada Boost (AB)
Gradient Boosting Machines (GBM)
Random Forest (RF)
Extra Tree (ET)
Neural Network (NN)
These models were chosen to provide diversity and robustness in capturing patterns within the dataset, each with its unique strengths and characteristics.
temp = clean_data2.copy()
X = temp.drop(columns=['loanId','anon_ssn','target'])
Y = temp['target']
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size=0.25,
random_state=0,
stratify=temp['target'])
Rationale: It's important to highlight the use of the stratify parameter, which is employed to preserve the distribution of the train and test data, ensuring a fair and representative comparison. The dataset is divided into a 75-25 split for training and testing, a common ratio that has demonstrated effective performance in various articles. It's crucial to acknowledge that the dataset characteristics in those articles might differ from ours. Reference Link
models = GetBasedModel()
names,results,roc_result = BasedLine2(X_train, y_train,models)
LDA model: Mean (0.971329), Std (0.000614) CART model: Mean (0.941714), Std (0.006999) NB model: Mean (0.958541), Std (0.009684) SVM model: Mean (0.971329), Std (0.000614) AB model: Mean (0.971059), Std (0.000669) GBM model: Mean (0.969713), Std (0.001829) RF model: Mean (0.971329), Std (0.000614) ET model: Mean (0.969040), Std (0.001702) NN model: Mean (0.934720), Std (0.097865)
PlotBoxR().PlotResult(names,results)
basedLineScore = ScoreDataFrame(names,results,roc_result)
basedLineScore
| Model | Score | ROC_Score | |
|---|---|---|---|
| 0 | LDA | 0.9713 | 0.566087 |
| 1 | CART | 0.9417 | 0.519506 |
| 2 | NB | 0.9585 | 0.534330 |
| 3 | SVM | 0.9713 | 0.529844 |
| 4 | AB | 0.9711 | 0.596704 |
| 5 | GBM | 0.9697 | 0.588402 |
| 6 | RF | 0.9713 | 0.531140 |
| 7 | ET | 0.9690 | 0.542776 |
| 8 | NN | 0.9347 | 0.511815 |
Analysis : Here are some comments on the statistical performance of spot-check models:
Consistency vs. Discrimination
AUC Insights
Trade-Offs
Focusing on a subset of models for further training and testing can help streamline the process and concentrate on the models that have shown promise based on accuracy or ROC score. Below is a summary:
Linear Discriminant Analysis (LDA):
Support Vector Machines (SVM):
Ada Boost (AB):
Random Forest (RF):
Gradient Boosting Machines (GBM):
Continuing with these models allows for a more focused and efficient analysis, enabling a deeper exploration of their performance characteristics. It's essential to balance the trade-off between accuracy and ROC score, as both metrics provide valuable insights into model performance.
basedLineScore = basedLineScore.drop([1,2,7,8])
Rationale : The preprocessing of numerical features varies between tree-based and non-tree-based models, and the distinctions are as follows:
General Approach
Common Preprocessing Techniques
Understanding and applying these preprocessing techniques are essential steps in optimizing the performance of machine learning models, aligning with the specific requirements of tree-based or non-tree-based algorithms. The choice of preprocessing methods depends on the characteristics of the dataset and the nature of the machine learning model being employed. Thus, only MinMax Scaler and Standard Scaler will be considered to be applied into our dataset for comparison.
1. Standardize Normalization
models = GetScaledModel('standard')
names,results,roc_result = BasedLine2(X_train, y_train,models)
standard LDA model: Mean (0.971329), Std (0.000614) standard SVM model: Mean (0.971329), Std (0.000614) standard AB model: Mean (0.971059), Std (0.000669) standard GBM model: Mean (0.969848), Std (0.001371) standard RF model: Mean (0.971329), Std (0.000614)
PlotBoxR().PlotResult(names,results)
StandardScore = ScoreDataFrame(names,results,roc_result)
StandardScore
| Model | Score | ROC_Score | |
|---|---|---|---|
| 0 | standard LDA | 0.9713 | 0.566087 |
| 1 | standard SVM | 0.9713 | 0.511795 |
| 2 | standard AB | 0.9711 | 0.596704 |
| 3 | standard GBM | 0.9698 | 0.588447 |
| 4 | standard RF | 0.9713 | 0.546802 |
Analysis : The table above provides insights into the performance of five different models, showcasing a remarkable consistency in accuracy. The mean accuracy across these models exhibits minimal variability, with the highest difference being 0.0013. This consistency indicates that the accuracy remains stable across different k-folds. However, it's crucial to acknowledge that the ROC score has not experienced a parallel improvement. This lack of improvement could be attributed to the imbalanced dataset, where the target variable is predominantly represented by class 0, constituting 97% of the entire dataset. The imbalance introduces a challenge in making accurate conclusions solely based on accuracy.
It's essential to recognize the potential pitfalls of relying solely on accuracy, especially in the context of imbalanced data. Wrong assumptions stemming from an imbalanced dataset can lead to inaccurate conclusions. To address this, strategies for handling imbalanced data, such as SMOTE or the use of appropriate evaluation metrics, should be considered in subsequent steps of the analysis. This will ensure a more nuanced and accurate assessment of model performance.
2. MinMax Normalization
models = GetScaledModel('minmax')
names,results,roc_result = BasedLine2(X_train, y_train,models)
minmax LDA model: Mean (0.971329), Std (0.000614) minmax SVM model: Mean (0.971329), Std (0.000614) minmax AB model: Mean (0.971194), Std (0.000656) minmax GBM model: Mean (0.970387), Std (0.001342) minmax RF model: Mean (0.971194), Std (0.000656)
PlotBoxR().PlotResult(names,results)
MinMaxScore = ScoreDataFrame(names,results,roc_result)
MinMaxScore
| Model | Score | ROC_Score | |
|---|---|---|---|
| 0 | minmax LDA | 0.9713 | 0.566087 |
| 1 | minmax SVM | 0.9713 | 0.496019 |
| 2 | minmax AB | 0.9712 | 0.603339 |
| 3 | minmax GBM | 0.9704 | 0.596057 |
| 4 | minmax RF | 0.9712 | 0.555150 |
Analysis : The table and plots above showcases the accuracy and ROC score for the dataset normalized using the Min-Max technique. A consistent pattern is evident, mirroring the observations made with the standardized normalization dataset. In both cases, the models exhibit comparable accuracy values, indicating that the choice of normalization technique (either Min-Max or Standardization) has a limited impact on the overall predictive performance. This consistency across normalization methods suggests that the algorithms employed are robust and relatively insensitive to the specific scaling approach used.
It's noteworthy that while accuracy remains consistent, the ROC score, a metric particularly sensitive to class imbalances, also displays similar patterns. The impact of normalization on ROC scores aligns with expectations, as both techniques aim to maintain the relative relationships between feature values.
In short, the choice between Min-Max normalization and Standardization may depend on specific considerations such as the characteristics of the dataset, the nature of the features, and the preferences of the modeling algorithms. The analysis reaffirms the importance of assessing multiple metrics and considering the specific requirements of the problem at hand when making decisions about preprocessing techniques.
print(basedLineScore.mean(numeric_only = True))
print("_____________")
print(MinMaxScore.mean(numeric_only = True))
print("_____________")
print(StandardScore.mean(numeric_only = True))
Score 0.970940 ROC_Score 0.562436 dtype: float64 _____________ Score 0.971080 ROC_Score 0.563331 dtype: float64 _____________ Score 0.970960 ROC_Score 0.561967 dtype: float64
Analysis : Thus, based on the average accuracy and ROC score, we decided to process the rest of the model training and testing with Min Max normalization. By choosing Min-Max normalization, we aim to maintain consistency in the preprocessing approach and evaluate the models under a standardized scaling method. This can be particularly beneficial for models that might be sensitive to the scale of input features. It's essential to monitor the impact of normalization on model performance, as different models may respond differently to various preprocessing techniques. Overall, this decision provides a systematic and standardized way to assess the chosen models.
3. Class Imbalanced
Class imbalance refers to a situation in a classification problem where the distribution of instances across different classes is not uniform. In other words, one or more classes have significantly fewer samples compared to others. This imbalance can introduce challenges in training machine learning models, especially when the minority class (the class with fewer instances) is of particular interest or importance. We can observe from our plot below in the target variable, where it showed a high imbalanced class in dependent variable
plot_discrete_variable(clean_data2,'target',0)
Analysis : Addressing class imbalance is crucial to ensure fair and accurate model performance, especially in situations where both classes are important. Thus, SMOTE (synthetic minority oversampling technique) is to be explored to elimite such issue. It is one of the most commonly used oversampling methods to solve the imbalance problem. It aims to balance class distribution by randomly increasing minority class examples by replicating them.
smote = SMOTE(sampling_strategy='auto', random_state=SEED)
temp = clean_data2.copy()
X = temp.drop(columns=['loanId','anon_ssn','target'],axis = 1)
Y = temp['target']
X_resampled, y_resampled = smote.fit_resample(X,Y)
X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(X_resampled,
y_resampled,
test_size=0.25,
random_state=0)
# Print class distribution before applying SMOTE
print("Class distribution before SMOTE:")
print("Class 0: {}".format(sum(y_train == 0)))
print("Class 1: {}".format(sum(y_train == 1)))
Class distribution before SMOTE: Class 0: 7216 Class 1: 213
# Print class distribution after SMOTE
print("\nClass distribution after SMOTE:")
print("Class 0: {}".format(sum(y_train_resampled == 0)))
print("Class 1: {}".format(sum(y_train_resampled == 1)))
Class distribution after SMOTE: Class 0: 7234 Class 1: 7199
Analaysis : After implementing the Synthetic Minority Over-sampling Technique (SMOTE) to address the imbalanced dataset, a more equitable distribution between classes 0 and 1 is observed. This rebalancing of the dataset contributes to a more robust and fair representation of both classes, addressing the challenges posed by the initial imbalance.
Subsequently, the modeling process is continued with the application of Min-Max normalization techniques on the SMOTE-treated dataset. This combination aims to leverage the advantages of both approaches—SMOTE for handling imbalanced classes and Min-Max normalization for scaling features to a consistent range. The integration of SMOTE and Min-Max normalization is anticipated to enhance the model's performance by mitigating the impact of class imbalances and ensuring that features are uniformly scaled. This sets the stage for more reliable and accurate predictions, particularly in scenarios where imbalanced classes can significantly impact model training and evaluation.
models = GetScaledModel('minmax')
names,re_results,re_roc_result = BasedLine2(X_train_resampled, y_train_resampled,models)
minmax LDA model: Mean (0.618583), Std (0.015577) minmax SVM model: Mean (0.753759), Std (0.005713) minmax AB model: Mean (0.819926), Std (0.009210) minmax GBM model: Mean (0.897318), Std (0.008443) minmax RF model: Mean (0.975404), Std (0.002756)
PlotBoxR().PlotResult(names,re_results)
SMOTE_MinMax_baseline = ScoreDataFrame(names,re_results,re_roc_result)
SMOTE_MinMax_baseline
| Model | Score | ROC_Score | |
|---|---|---|---|
| 0 | minmax LDA | 0.6186 | 0.650455 |
| 1 | minmax SVM | 0.7538 | 0.827655 |
| 2 | minmax AB | 0.8199 | 0.908750 |
| 3 | minmax GBM | 0.8973 | 0.961710 |
| 4 | minmax RF | 0.9754 | 0.990812 |
Analysis : Below are my analysis on each model with min-max normalization technique applied:
LDA | | Imbalanced Dataset | SMOTE Balanced Dataset | | --- | --- | ---| | Accuracy | 97.13% | 61.86% | | ROC Score | 56.61% | 65.05%|
SVM | | Imbalanced Dataset | SMOTE Balanced Dataset | | --- | --- | ---| | Accuracy | 97.13% | 75.38% | | ROC Score | 49.60% | 82.77%|
AB | | Imbalanced Dataset | SMOTE Balanced Dataset | | --- | --- | ---| | Accuracy | 97.12% | 81.99% | | ROC Score | 60.33% | 90.88%|
GBM | | Imbalanced Dataset | SMOTE Balanced Dataset | | --- | --- | ---| | Accuracy | 97.04% | 89.73% | | ROC Score | 59.61% | 96.17%|
RF | | Imbalanced Dataset | SMOTE Balanced Dataset | | --- | --- | ---| | Accuracy | 97.12% | 97.54% | | ROC Score | 55.52% | 99.08%|
Let's continue our analysis by digging in what's happened within the trained RF model by utilising techniques below:
1. Feature Importance : Ensemble methods such as Random Forest and Extra Trees, which consist of bagged decision trees, offer a valuable approach for gauging the significance of features in a dataset. Through the aggregation of multiple decision trees, these methods provide a robust means of estimating feature importance. By analyzing the collective behavior of the trees within the ensemble, we can discern the relative contribution of each feature towards the overall predictive performance of the model. This technique proves especially useful in identifying key variables that significantly influence the model's decision-making process, aiding in the interpretation and understanding of feature relevance in the context of the predictive task at hand.
2. SHAP Analysis : SHAP (SHapley Additive exPlanations) values are a way to explain the output of any machine learning model. It uses a game theoretic approach that measures each player's contribution to the final outcome. In machine learning, each feature is assigned an importance value representing its contribution to the model's output. SHAP values show how each feature affects each final prediction, the significance of each feature compared to others, and the model's reliance on the interaction between features.
temp = temp.drop(columns=['loanId','anon_ssn','target'])
plot_importance(X_train_resampled, y_train_resampled,temp)
Analysis : In overall, as depicted in the variable importance plot above, it becomes evident that the most influential factors affecting our target variables are leadType, leadCost,originallyScheduledPaymentAmount,clearfraudscore,payFrequency,and apr. These variables stand out as the top contributors in descending order to the predictive power of the model, indicating their significance in influencing the outcomes related to the target variable.
SHAP_functoin(X_train_resampled,X_test, y_train_resampled,y_test)
Observations : The overview plot presented above illustrates the significance of each feature within the model by using SHAP techniques. The findings indicate that key factors such as apr, nPaidOff, and state emerge as pivotal contributors in influencing the model's predicting whether a loan wiill have the potential default or not. These features play substantial roles in determining the overall predictions, emphasizing their importance in the decision-making process.
Analysis : The second SHAP plot illustrates the variables influencing only the default applicants, which filtered by only "1" in target variable. It's evident that apr, nPaidOff, state exhibit the highest influence in determining the default scenario. From this analysis, we have gained the following insights into loan defaulting:
apr : A negative SHAP value associated with a higher APR indicates a tendency for the prediction of loan default to be negatively affected. This aligns with our expectations, as applicants with higher APR rates are more likely to receive a prediction of "0," suggesting a lower probability of non-defaulting. MoneyLion should consider controlling its APR within a range where applicants can comfortably meet their obligations.nPaidOff : A predominantly positive SHAP value for the nPaidOff variable suggests that higher values in this variable increase the likelihood of applicants defaulting on their payments. This is an intriguing observation from the dataset's historical records, indicating that applicants with an increasing number of paid-off loans tend to have a higher probability of defaulting on the remaining payments. MoneyLion can address this by implementing targeted marketing strategies, such as promotions for the final payments, to mitigate the risk associated with this pattern.In short, the difference in the set of variables identified as important by Extra Tree Classifier and SHAP values can be attributed to the distinct methodologies employed by these techniques to measure feature importance.
Extra Tree Classifier (ETC)
SHAP Values
The variation in the sets of important variables arises because ETC and SHAP interpret "importance" differently. ETC prioritizes features that aid in optimal data splitting during the tree-building process, while SHAP values provide a holistic assessment of a feature's impact by considering its contribution in various feature combinations.
In summary, the disparities in the identified important variables highlight the diversity in methodologies and underlying principles of Extra Tree Classifier and SHAP values for assessing feature importance.
In the course of this notebook's analysis, the following milestones have been accomplished:
Our project objective has been acheived by experiementing various model. In the end, the most effective predictive model for assessing the risk profile of applicants is the Random Forest model. By incorporating min max normalization and SMOTE techniques, this model achieves an impressive accuracy of 97.54% and an outstanding ROC score of 99.08%. As a result, we can conclude that our model has the capability to confidently identify and eliminate potential default applicants.
Advanced Models and Hyperparameter Tuning
Imbalanced Data Techniques
Inclusion of Additional Variables
Exclusion of Variables
Feature Engineering Techniques
Thus, it marked the end of this assessment.
Total Time Spent : 45 hours - 50 hours
et = time.time()
# get the execution time
elapsed_time = et - st
print('Execution time:', (elapsed_time / 60), 'minutes')
Execution time: 14.499678719043732 minutes